AJ's blog

November 11, 2007

Error Handling is Error Management

Filed under: .NET, .NET Framework, Software Architecture, Software Development, WF — ajdotnet @ 12:21 pm

An error occurred, oh my! Try/catch, anything more to say? You bet!

Note: This is part of a series:
1. Talking WF (introduction)
2. Workflow Instance State Management
3. Workflow Communication and Workflow Communication Clarification
4. Hosting WF
5. Talking to the Windows Service
6. Robust Operations
more to come…

The last post talked about preventing errors, but no matter how hard we try, there will be errors. Thus we need to think about Error Management. This is more than simple exception handling, since our workflow system shall support long running robust operations, and we cannot afford any loss of data. Remember, this is about MY VACATION, please make sure this particular workflow does not get lost. All in all, error management includes:

  • Ordinary error handling, i.e. handling of exceptions
  • Restart after mending the bug
  • Provide error feedback

Ordinary Error Handling

OK, let’s start with the try/catch area. Errors are inevitable. No matter how many test code you write, this will only prove you didn’t find the cause for the next exception yet. Therefore rather than preventing every exception it is far more important to handle exceptions properly. The magical word here is „last chance exception handler“ (LCEH). If everything else failed, the LCEH ensures that the exception gets caught. In a web application you’ll write your LCEH code in Application.OnError, a Windows Service should do this in the main loop of the worker thread, the workflow engine provides a respective event (WorkflowRuntime.WorkflowTerminated) for workflow instances.

It is not the LCEH’s responsibility to prevent data loss or to keep the business data consistent. That’s what transactions are for. The job of the LCEH is to

  • prevent vanishing workflow instances
  • prevent inconsistent WF instance state
  • prevent unnoticed errors

To accomplish this it should

  • ensure the exception is properly logged (e.g. within the windows event log)
  • ensure the workflow instance is in a defined state (e.g. set the state to „What the heck?“)
  • ensure the exception is properly announced, e.g. write some additional error information to the workflow instance state table.

Again, this is not about preventing errors; it’s about making them apparent.

Restart After Fixing the Bug

OK, it happened. The child jumped headlong into the wishing well, it screams and nobody can pretend not to have noticed. Now what?

If you do nothing, the child will scream its head off, but that’s it. Talking about software, this is what usually happens. Talking about children and data this is not enough. Of course you should get the child out, clean it, comfort it, and make sure it cannot fall in again. Or you could leave it where it is and start a new one to fill the gap. With children you’ll probably (… no? … 😉 ) opt for the first choice, with software the second one is usually simpler to implement.

So what do you do if you are the admin and you have a workflow instance in an error state? The cause is diagnosed, the bug is fixed. Actually you could simply go on and continue at the location just before the error happened. Unfortunately it is not quite that easy to do that (time shift wasn’t included in the wishing well example). So we have four options, but none is actually pleasant:

  1. Ignoring the situation. This may actually serve you quite well, as long as errors can be diagnosed and mended by someone manually. But if this is too error prone, if it happens too often, or if you have to adhere to some compliance rules (GAAP, legal issues, etc.) you’ll have to think about a better solution.
  2. Announce the problem. Do nothing about the workflow instance but sent an email to all people involved, telling them what happened and asking them to start all over again. And kill the child, it’s not needed anymore and it’s become an annoyance.
  3. Continue the workflow instance. This is what your customer will probably ask you to do. And it is by far the most complicated and error prone option. You’ll have to anticipate any error and provide loopbacks to any regular shape within the workflow.
  4. Start all over. The current workflow instance is flawed, dump it. And start a fresh workflow instance automatically (more or less).

Lets ignore option 1 from now on, we got here because it’s not sufficient, anyway. It goes without saying that any other option would need some kind of UI, showing the admin the invalid workflow instances and offering respective means to mend the issue accordingly.

Option 2 (telling about the failure) is by far the most simple one and takes the least effort. Go for it if you can. Once the audience of you application becomes bigger and/or errors more regular, option 2 will no longer be sufficient. Forget about option 3 (continuing the workflow instance). It is not feasible for any non-trivial workflow. Period. Rather option 4 (re-running the workflow) should be your choice. I’ll dedicate a separate post to this pattern.

You may run into a situation where even option 4 does not fulfill the demand, e.g. if it cannot rely on “stale” data gathered before. This may call for a dedicated correction workflow. Try to avoid this situation by falling back option 2. Some errors simply cannot be mended automatically.

Error Feedback

If your web application fails, it’ll show an error page — more precisely if your web application fails synchronously. But what about failures that happen asynchronously. Failures that happen during timeout handling (exactly 2:13 AM). Of course our LCEH makes sure all the gory details are in the event log or any other log file, ready to be collected by the administrator and analyzed by the next available developer. But what about the end user? What is the substitute for the error page in this case?

The end user certainly needs no error page that pops up when he starts the web application and tells him that (any) one workflow instance {144215AE-28EB-4c36-81F6-7A0F9EC69F76} failed last night. What he needs… well, what I expect if one of my vacation requests failed, is:

  • Some way of identifying the vacation request. In the long list of my vacation requests in the web application it should stand out, e.g. it could be in a certain error state marked red.
  • Some way of understanding what’s going on and what I am supposed to do. E.g. there could be an “info” column in the list of vacation requests, saying something like “There was an error writing the vacation to the payroll system. The administrator has been notified. Since he is on vacation and won’t be back in time you may just forget about your vacation.”
  • Some way of being notified of the problem, why else would I look into the web application in the first place. Read “email”, and it could tell me exactly the same, the web application does. (No reason to omit either!)

Easier said than done? Well, yes. The challenge here is that I expect a lot to happen under error conditions. What if the error occurred during sending an email? Send another one to tell me that emails cannot be sent?

The reality is that it does not exactly make sense to put that much effort in code that should by all means never run. The realistic approach would be:

  • Make vacation requests (i.e. any data that is subject to asynchronous processing) in temporary and error states clearly distinguishable in the web application. This might mean introducing yet another group of states, the “I’m going to do something and the user should never see the intermediate state because it’s overwritten after I have done my job properly”-states.
  • Make sure somebody gets notified. If there is a danger of losing data or producing inconsistent data, at least try to send an email to the administrator. Or shut down and make him look for the cause.

To summarize this post in broader terms: It is futile to try to mend every error condition. Rather than trying to join Sisyphus, try to make errors apparent. Entering an error state is far better than insufficiently trying to handle (and effectively obscuring) it.

I didn’t plan that but I think it makes sense to spend the next post on the replay pattern…

PS: Finally a recommended post, quote “Programmers often have a misconception that their software should always work.”

That’s all for now folks,
AJ.NET

kick it on DotNetKicks.com

Advertisement

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: