AJ's blog

November 11, 2007

Error Handling is Error Management

Filed under: .NET, .NET Framework, Software Architecture, Software Development, WF — ajdotnet @ 12:21 pm

An error occurred, oh my! Try/catch, anything more to say? You bet!

Note: This is part of a series:
1. Talking WF (introduction)
2. Workflow Instance State Management
3. Workflow Communication and Workflow Communication Clarification
4. Hosting WF
5. Talking to the Windows Service
6. Robust Operations
more to come…

The last post talked about preventing errors, but no matter how hard we try, there will be errors. Thus we need to think about Error Management. This is more than simple exception handling, since our workflow system shall support long running robust operations, and we cannot afford any loss of data. Remember, this is about MY VACATION, please make sure this particular workflow does not get lost. All in all, error management includes:

  • Ordinary error handling, i.e. handling of exceptions
  • Restart after mending the bug
  • Provide error feedback

Ordinary Error Handling

OK, let’s start with the try/catch area. Errors are inevitable. No matter how many test code you write, this will only prove you didn’t find the cause for the next exception yet. Therefore rather than preventing every exception it is far more important to handle exceptions properly. The magical word here is „last chance exception handler“ (LCEH). If everything else failed, the LCEH ensures that the exception gets caught. In a web application you’ll write your LCEH code in Application.OnError, a Windows Service should do this in the main loop of the worker thread, the workflow engine provides a respective event (WorkflowRuntime.WorkflowTerminated) for workflow instances.

It is not the LCEH’s responsibility to prevent data loss or to keep the business data consistent. That’s what transactions are for. The job of the LCEH is to

  • prevent vanishing workflow instances
  • prevent inconsistent WF instance state
  • prevent unnoticed errors

To accomplish this it should

  • ensure the exception is properly logged (e.g. within the windows event log)
  • ensure the workflow instance is in a defined state (e.g. set the state to „What the heck?“)
  • ensure the exception is properly announced, e.g. write some additional error information to the workflow instance state table.

Again, this is not about preventing errors; it’s about making them apparent.

Restart After Fixing the Bug

OK, it happened. The child jumped headlong into the wishing well, it screams and nobody can pretend not to have noticed. Now what?

If you do nothing, the child will scream its head off, but that’s it. Talking about software, this is what usually happens. Talking about children and data this is not enough. Of course you should get the child out, clean it, comfort it, and make sure it cannot fall in again. Or you could leave it where it is and start a new one to fill the gap. With children you’ll probably (… no? … 😉 ) opt for the first choice, with software the second one is usually simpler to implement.

So what do you do if you are the admin and you have a workflow instance in an error state? The cause is diagnosed, the bug is fixed. Actually you could simply go on and continue at the location just before the error happened. Unfortunately it is not quite that easy to do that (time shift wasn’t included in the wishing well example). So we have four options, but none is actually pleasant:

  1. Ignoring the situation. This may actually serve you quite well, as long as errors can be diagnosed and mended by someone manually. But if this is too error prone, if it happens too often, or if you have to adhere to some compliance rules (GAAP, legal issues, etc.) you’ll have to think about a better solution.
  2. Announce the problem. Do nothing about the workflow instance but sent an email to all people involved, telling them what happened and asking them to start all over again. And kill the child, it’s not needed anymore and it’s become an annoyance.
  3. Continue the workflow instance. This is what your customer will probably ask you to do. And it is by far the most complicated and error prone option. You’ll have to anticipate any error and provide loopbacks to any regular shape within the workflow.
  4. Start all over. The current workflow instance is flawed, dump it. And start a fresh workflow instance automatically (more or less).

Lets ignore option 1 from now on, we got here because it’s not sufficient, anyway. It goes without saying that any other option would need some kind of UI, showing the admin the invalid workflow instances and offering respective means to mend the issue accordingly.

Option 2 (telling about the failure) is by far the most simple one and takes the least effort. Go for it if you can. Once the audience of you application becomes bigger and/or errors more regular, option 2 will no longer be sufficient. Forget about option 3 (continuing the workflow instance). It is not feasible for any non-trivial workflow. Period. Rather option 4 (re-running the workflow) should be your choice. I’ll dedicate a separate post to this pattern.

You may run into a situation where even option 4 does not fulfill the demand, e.g. if it cannot rely on “stale” data gathered before. This may call for a dedicated correction workflow. Try to avoid this situation by falling back option 2. Some errors simply cannot be mended automatically.

Error Feedback

If your web application fails, it’ll show an error page — more precisely if your web application fails synchronously. But what about failures that happen asynchronously. Failures that happen during timeout handling (exactly 2:13 AM). Of course our LCEH makes sure all the gory details are in the event log or any other log file, ready to be collected by the administrator and analyzed by the next available developer. But what about the end user? What is the substitute for the error page in this case?

The end user certainly needs no error page that pops up when he starts the web application and tells him that (any) one workflow instance {144215AE-28EB-4c36-81F6-7A0F9EC69F76} failed last night. What he needs… well, what I expect if one of my vacation requests failed, is:

  • Some way of identifying the vacation request. In the long list of my vacation requests in the web application it should stand out, e.g. it could be in a certain error state marked red.
  • Some way of understanding what’s going on and what I am supposed to do. E.g. there could be an “info” column in the list of vacation requests, saying something like “There was an error writing the vacation to the payroll system. The administrator has been notified. Since he is on vacation and won’t be back in time you may just forget about your vacation.”
  • Some way of being notified of the problem, why else would I look into the web application in the first place. Read “email”, and it could tell me exactly the same, the web application does. (No reason to omit either!)

Easier said than done? Well, yes. The challenge here is that I expect a lot to happen under error conditions. What if the error occurred during sending an email? Send another one to tell me that emails cannot be sent?

The reality is that it does not exactly make sense to put that much effort in code that should by all means never run. The realistic approach would be:

  • Make vacation requests (i.e. any data that is subject to asynchronous processing) in temporary and error states clearly distinguishable in the web application. This might mean introducing yet another group of states, the “I’m going to do something and the user should never see the intermediate state because it’s overwritten after I have done my job properly”-states.
  • Make sure somebody gets notified. If there is a danger of losing data or producing inconsistent data, at least try to send an email to the administrator. Or shut down and make him look for the cause.

To summarize this post in broader terms: It is futile to try to mend every error condition. Rather than trying to join Sisyphus, try to make errors apparent. Entering an error state is far better than insufficiently trying to handle (and effectively obscuring) it.

I didn’t plan that but I think it makes sense to spend the next post on the replay pattern…

PS: Finally a recommended post, quote “Programmers often have a misconception that their software should always work.”

That’s all for now folks,

kick it on DotNetKicks.com

November 4, 2007

Robust Operations

Filed under: .NET, .NET Framework, Software Architecture, Software Development, WF — ajdotnet @ 4:26 pm

I guess you don’t want to know about my latest “Vista vs. Windows XP on one machine” experience. Suffice it to say that DOS and FDSIK saved my day. And that it reminded me that the next post should be about robustness

Note: This is part of a series:
1. Talking WF (introduction)
2. Workflow Instance State Management
3. Workflow Communication and Workflow Communication Clarification
4. Hosting WF
5. Talking to the Windows Service
more to come (error treatment).

Last week at the coffee machine…

David Developer: „Hey Mrs. Accounting. That vacation of mine wasn‘t properly included in my salary!“

Alica Accounting: „I know of no vacation. And I really don‘t have time for this. I have to track down Carl Consultant. He went on a vacation that actually was rejected!“

Peter Production (munching a donut): „Oh, fat muft all haff happened laft monf … (gulp) … excuse me, when I had to reboot a few servers. Probably Vacations@SDX tried to access the payroll system when it was down. I told you, didn’t I?”

Alica Accounting: „Yes?“ (like in „What the heck is this guy talking about?“)

Peter Production: „Well, the application should have sent an email about that failure, but Exchange was also down. (slowly walking out) But since Paul Projectmanager agreed, the vacation was listed as granted. (From the corridor) And Carl Consultants vacation request somehow vanished. (Gone)“

Alica Accounting: „Who built such a mess?“

David Developer: „Oh, I … *aham* … I really don‘t know…“

People pissed off, money involved, considerable cleanup work required. Do you want to be David Developer in this case? Do you need more motivation to think about robust processing and error handling?

Robust processing (i.e. preventing errors) and error handling are actually two sides of the same coin. What’s needed to alleviate the issues implied in the above conversation is as follows:

Preventing Errors:

  • Calling another application semantically synchronously – and dealing with unavailability (the payroll system, as synchronous acknowledge is crucial for the workflow)
  • Calling another application semantically asynchronously (the email system, as email is inherently asynchronous)
  • Doing several things semantically in a transactional fashion – and breaking transaction contexts if expedient.

The reason I stressed the word semantically should become clear in a minute. Please note that we cannot prevent the erroneous behaviour (e.g. unavailability); but we can prevent it to become a showstopper for our application. Only should that effort fail we’ll have to deal with the error (which I’ll defer until the next post).

Calling Another Application Synchronously

Many things are done synchronously because it‘s the natural way to do – and because it‘s so easy. Some things on the other hand have to be done synchronously despite the fact that it‘s far from easy. Talking to another application usually falls in the second category… .
When talking to another application you‘ll have to make sure the call actually get‘s through. If it does not, retry! (And with this your simple call becomes a loop). If there is no chance, fail. But don‘t fail like „throw Exception and wreck the workflow instance“, fail like „execute this branch of the workflow that puts the workflow in a defined error state, to be mended by some operator and even clearly visible within Vacations@SDX (nothing fancy, a small neon light and a bell will do)“. If you actually manage to get a call through, be sure it was processed correctly. And correctly does not only mean „no exception thrown“. It also includes „sent back the correct reply.“ If you get some unexpected return value, one that you haven‘t been told about when you asked for the contract (and thus you cannot know whether it is safe to go on), again fail!

The crux is: Failing to call some other system, a system that may be offline, is no technical issue. It‘s something to be expected and is has to be handled accordingly. Also applications evolve (and nobody will bother to tell you, just because you happen to call that WebService) and again, this is to be expected!

Internal WebServices may not need this kind of caution but I would apply this rule to any external application, WebService, whatever API.

Calling Another Application Asynchronously

Sending an email is an inherently asynchronous operation. If all goes well you talk to the mail gateway and that‘s that. No way to know whether and when the mail is received. So why even bother if the gateway accepted the mail? As long as it will eventually accept it?

Suppose you just put the email in a certain database table and go on as if you did everything you had to do. No special error handling, no external dependency. Nice and easy.
Suppose there is a Mail Service (e.g. a Windows Service that you just wrote) that regularly picks up mails from said table and tries to send them, one at a time. If the mail gateway was not available it would retry. If there was an error it would notify the admin.

„But how does my application know whether the Mail Service actually did sent the email? Don‘t I need some kind of feedback?“

What for? Email is unsafe, even if the gateway accepted it it might still not reach the recipient. So why bother if something bad happened at the initial step?

Side note: This Mail Service would be „the last mail access to write“, and it would add substantial robustness to sending emails. Advantages include:

  • asynchronous processing and load levelling (even in the case of mass emails)
  • application independence (if you happen to have more than one application in need of sending emails)
  • just one mail gateway account, infrastructure hassle just once.
  • the option of adding additional features (routing incoming reply emails to the respective application, regularly resending unanswered emails, escalation emails, …)

The crux here is: If something allready is asynchronous, don’t bother making the call to the system foolproof. Factor it out and let some dedicated service handle the gory details about seting up the connection, retry scenarios, escalation, etc.. This pattern also holds for other use cases: a file drop to some network drive, sending an SMS, accessing a printer.

Doing Several Things in a Transactional Fashion

transactionscopeactivityTransactions are nothing new. In code you use some kind of begin transaction/end transaction logic, in WF there is a respective shape to span a transaction boundary (TransactionScopeActivity). Use it! A transaction scope activity not only spans the contained logic, it also adds a persistent point (see Introduction to Hosting Windows Workflow Foundation).

Enter „informal“ transactions…

Suppose neither the payroll system nor the email gateway support transactions. The textbook answer for non-transactional systems would be „compensation“. But if the payroll system needs special permission or treatment to undo an entry? And how do you fetch back an email?

In my experience compensation at this technical level is largely a myth. If anything, compensation is a complicated business process that takes its own workflow to handle.

Obviously the transaction boundaries we assumed do not quite work in these cases. But rather than trying endlessly to make it work, we might take a step back and rethink the state model. Do we have to go directly from „waiting for approval“ to „finished“? „Waiting“ and „finished“ are business states, states that describe interaction with or feedback for some end user. But nobody does deny us the option to introduce intermediary states like „granting vacation (processing payroll update)“ and „granting vacation (sending notification emails)“. This would effectively cut the the former transactional scope in pieces, thus eliminate the immediate need for ACID transactions or compensation.

What do we gain from this? Should anything happen along the way, the state would remain in one of the intermediate states. It would not enter the „vacation granted“ state and effectively lie to us. And what‘s more, the state would actually tell us what part of the process failed, the payroll system or the email. And it would be obvious, clearly calling out in the list of open vacation requests.

The crux here is: Don‘t try to mirror business transactions to technical transaction boundaries. Don‘t do several fragile things in one step, separate them.

Feeling better? More confident? Sorry to disappoint you, but no matter how hard you try, eventually there will be errors. And how to address them will be covered in the next post.

That’s all for now folks,

kick it on DotNetKicks.com

Create a free website or blog at WordPress.com.