AJ's blog

November 4, 2007

Robust Operations

Filed under: .NET, .NET Framework, Software Architecture, Software Development, WF — ajdotnet @ 4:26 pm

I guess you don’t want to know about my latest “Vista vs. Windows XP on one machine” experience. Suffice it to say that DOS and FDSIK saved my day. And that it reminded me that the next post should be about robustness

Note: This is part of a series:
1. Talking WF (introduction)
2. Workflow Instance State Management
3. Workflow Communication and Workflow Communication Clarification
4. Hosting WF
5. Talking to the Windows Service
more to come (error treatment).

Last week at the coffee machine…

David Developer: „Hey Mrs. Accounting. That vacation of mine wasn‘t properly included in my salary!“

Alica Accounting: „I know of no vacation. And I really don‘t have time for this. I have to track down Carl Consultant. He went on a vacation that actually was rejected!“

Peter Production (munching a donut): „Oh, fat muft all haff happened laft monf … (gulp) … excuse me, when I had to reboot a few servers. Probably Vacations@SDX tried to access the payroll system when it was down. I told you, didn’t I?”

Alica Accounting: „Yes?“ (like in „What the heck is this guy talking about?“)

Peter Production: „Well, the application should have sent an email about that failure, but Exchange was also down. (slowly walking out) But since Paul Projectmanager agreed, the vacation was listed as granted. (From the corridor) And Carl Consultants vacation request somehow vanished. (Gone)“

Alica Accounting: „Who built such a mess?“

David Developer: „Oh, I … *aham* … I really don‘t know…“

People pissed off, money involved, considerable cleanup work required. Do you want to be David Developer in this case? Do you need more motivation to think about robust processing and error handling?

Robust processing (i.e. preventing errors) and error handling are actually two sides of the same coin. What’s needed to alleviate the issues implied in the above conversation is as follows:

Preventing Errors:

  • Calling another application semantically synchronously – and dealing with unavailability (the payroll system, as synchronous acknowledge is crucial for the workflow)
  • Calling another application semantically asynchronously (the email system, as email is inherently asynchronous)
  • Doing several things semantically in a transactional fashion – and breaking transaction contexts if expedient.

The reason I stressed the word semantically should become clear in a minute. Please note that we cannot prevent the erroneous behaviour (e.g. unavailability); but we can prevent it to become a showstopper for our application. Only should that effort fail we’ll have to deal with the error (which I’ll defer until the next post).

Calling Another Application Synchronously

Many things are done synchronously because it‘s the natural way to do – and because it‘s so easy. Some things on the other hand have to be done synchronously despite the fact that it‘s far from easy. Talking to another application usually falls in the second category… .
When talking to another application you‘ll have to make sure the call actually get‘s through. If it does not, retry! (And with this your simple call becomes a loop). If there is no chance, fail. But don‘t fail like „throw Exception and wreck the workflow instance“, fail like „execute this branch of the workflow that puts the workflow in a defined error state, to be mended by some operator and even clearly visible within Vacations@SDX (nothing fancy, a small neon light and a bell will do)“. If you actually manage to get a call through, be sure it was processed correctly. And correctly does not only mean „no exception thrown“. It also includes „sent back the correct reply.“ If you get some unexpected return value, one that you haven‘t been told about when you asked for the contract (and thus you cannot know whether it is safe to go on), again fail!

The crux is: Failing to call some other system, a system that may be offline, is no technical issue. It‘s something to be expected and is has to be handled accordingly. Also applications evolve (and nobody will bother to tell you, just because you happen to call that WebService) and again, this is to be expected!

Internal WebServices may not need this kind of caution but I would apply this rule to any external application, WebService, whatever API.

Calling Another Application Asynchronously

Sending an email is an inherently asynchronous operation. If all goes well you talk to the mail gateway and that‘s that. No way to know whether and when the mail is received. So why even bother if the gateway accepted the mail? As long as it will eventually accept it?

Suppose you just put the email in a certain database table and go on as if you did everything you had to do. No special error handling, no external dependency. Nice and easy.
Suppose there is a Mail Service (e.g. a Windows Service that you just wrote) that regularly picks up mails from said table and tries to send them, one at a time. If the mail gateway was not available it would retry. If there was an error it would notify the admin.

„But how does my application know whether the Mail Service actually did sent the email? Don‘t I need some kind of feedback?“

What for? Email is unsafe, even if the gateway accepted it it might still not reach the recipient. So why bother if something bad happened at the initial step?

Side note: This Mail Service would be „the last mail access to write“, and it would add substantial robustness to sending emails. Advantages include:

  • asynchronous processing and load levelling (even in the case of mass emails)
  • application independence (if you happen to have more than one application in need of sending emails)
  • just one mail gateway account, infrastructure hassle just once.
  • the option of adding additional features (routing incoming reply emails to the respective application, regularly resending unanswered emails, escalation emails, …)

The crux here is: If something allready is asynchronous, don’t bother making the call to the system foolproof. Factor it out and let some dedicated service handle the gory details about seting up the connection, retry scenarios, escalation, etc.. This pattern also holds for other use cases: a file drop to some network drive, sending an SMS, accessing a printer.

Doing Several Things in a Transactional Fashion

transactionscopeactivityTransactions are nothing new. In code you use some kind of begin transaction/end transaction logic, in WF there is a respective shape to span a transaction boundary (TransactionScopeActivity). Use it! A transaction scope activity not only spans the contained logic, it also adds a persistent point (see Introduction to Hosting Windows Workflow Foundation).

Enter „informal“ transactions…

Suppose neither the payroll system nor the email gateway support transactions. The textbook answer for non-transactional systems would be „compensation“. But if the payroll system needs special permission or treatment to undo an entry? And how do you fetch back an email?

In my experience compensation at this technical level is largely a myth. If anything, compensation is a complicated business process that takes its own workflow to handle.

Obviously the transaction boundaries we assumed do not quite work in these cases. But rather than trying endlessly to make it work, we might take a step back and rethink the state model. Do we have to go directly from „waiting for approval“ to „finished“? „Waiting“ and „finished“ are business states, states that describe interaction with or feedback for some end user. But nobody does deny us the option to introduce intermediary states like „granting vacation (processing payroll update)“ and „granting vacation (sending notification emails)“. This would effectively cut the the former transactional scope in pieces, thus eliminate the immediate need for ACID transactions or compensation.

What do we gain from this? Should anything happen along the way, the state would remain in one of the intermediate states. It would not enter the „vacation granted“ state and effectively lie to us. And what‘s more, the state would actually tell us what part of the process failed, the payroll system or the email. And it would be obvious, clearly calling out in the list of open vacation requests.

The crux here is: Don‘t try to mirror business transactions to technical transaction boundaries. Don‘t do several fragile things in one step, separate them.

Feeling better? More confident? Sorry to disappoint you, but no matter how hard you try, eventually there will be errors. And how to address them will be covered in the next post.

That’s all for now folks,

kick it on DotNetKicks.com


Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: