AJ's blog

December 14, 2007

Workflow Series Recap

Filed under: .NET, .NET Framework, Software Architecture, Software Development, WCF, WF — ajdotnet @ 6:57 pm

This post concludes my little workflow series…

1. Talking WF (introduction)
2. Workflow Instance State Management
3. Workflow Communication and Workflow Communication Clarification
4. Hosting WF
5. Talking to the Windows Service
6. Robust Operations
7. Error handling is Error Management

Remember where we started? WF. And what did I mostly talk about? Asynchronous data updates. Threading issues. Robust Windows Service implementation. Asynchronous error handling. Patterns. Practices. Guidelines.

And what did I rarely talk about? Shapes. Activities. Workflow specifics.

Remember what I wrote in the intro post: „And also very similar in the demand for knowledge of things that are not WF specific but are far from common knowledge for the average developer…“

Why are these things “far from common knowledge”, especially if they are not WF specific?

In my opinion this is because WF did not only introduce workflows. It also introduced asynchronous behavior and reliability demands far more forcibly than any other technology before.

  • We had threading before – but only the odd developer actually embraced it.
  • We had Windows Services before – but only rarely were they employed.
  • We had human workflow and state before – but it was largely hand grown and synchronous state machines.
  • We had demanding applications that could not live with these simplistic notions – but these called for specialized server software like BizTalk anyway.

With WF any developer might have to face all these new demands at once. It’s not WF in itself that is complex, in fact I can hardly imagine a workflow engine more easy to use than WF. It is the architectural consequences, the need for until then somewhat exotic concepts, the complicated asynchronous processing patterns. And the need to master all these demands at once.

Truth to be told…. ?

I once worked in a project that had quite amazing characteristics: 6 mio frontend transactions, processed to eventually enter the balance sheet, being subject to GAAP and quite a set of other legal compliance demands. Of course the software was built using BizTalk, not WF. We did then much of what I told you here. We had no choice and we had the budget.

Vacations@SDX does not even handle 1 vacation requests per day on the average. Does Vacations@SDX adhere to all the guidelines? Of course it does not. No one would have paid for that amount of fault tolerance just for one single workflow. We had to stay on budget and to meet a deadline; and making it foolproof simply was not feasible. (And the hosting part was a learning experience anyway.)

The reality is: What I presented here is in certain parts the 120% solution. (I am a friend of delivering 80% and waiting which of the missing 20% parts cause the most trouble. And 120% is simply 20% waste in any case.) But since this project was meant to have reference character we designed for the 120%. And with changing demands, new versions, or other applications built on the same principles, we may evolve the framework and the patterns. Gradually and where it hurts most. And in one respect we have accomplished more than with a simple “coding” experience: We have the architectural patterns (even if not fleshed out in toto) and we have the Windows Service Framework implementation.

The pragmatic point of view for you is: Decide for yourself which parts do hurt you. If you leave out certain aspects, do it knowingly. And I hope I could present some patterns that will help you addressing the aspects you can’t leave out.

Anyway, this concludes that little series. It’s been a number of posts, but believe me, this is only where it begins. On the missing list are testing of workflows, workflow design (including choosing between sequence and state driven workflows), and versioning, among others. Anyway, I wanted to talk about those areas that I came to realize caused the most problems for the people involved in the projects. I hope to have provided some useful hints, even if I got carried away sometimes 😉 .

PS: I know, I promised another post about the replay pattern. But given my current workload and other topics in my blog queue, I decided I should close this series this year. I haven’t forgotten it and if want to prioritize it, drop a respective comment.

I wish you a peaceful Christmas and a happy new year.

That’s all for now folks,
AJ.NET

November 11, 2007

Error Handling is Error Management

Filed under: .NET, .NET Framework, Software Architecture, Software Development, WF — ajdotnet @ 12:21 pm

An error occurred, oh my! Try/catch, anything more to say? You bet!

Note: This is part of a series:
1. Talking WF (introduction)
2. Workflow Instance State Management
3. Workflow Communication and Workflow Communication Clarification
4. Hosting WF
5. Talking to the Windows Service
6. Robust Operations
more to come…

The last post talked about preventing errors, but no matter how hard we try, there will be errors. Thus we need to think about Error Management. This is more than simple exception handling, since our workflow system shall support long running robust operations, and we cannot afford any loss of data. Remember, this is about MY VACATION, please make sure this particular workflow does not get lost. All in all, error management includes:

  • Ordinary error handling, i.e. handling of exceptions
  • Restart after mending the bug
  • Provide error feedback

Ordinary Error Handling

OK, let’s start with the try/catch area. Errors are inevitable. No matter how many test code you write, this will only prove you didn’t find the cause for the next exception yet. Therefore rather than preventing every exception it is far more important to handle exceptions properly. The magical word here is „last chance exception handler“ (LCEH). If everything else failed, the LCEH ensures that the exception gets caught. In a web application you’ll write your LCEH code in Application.OnError, a Windows Service should do this in the main loop of the worker thread, the workflow engine provides a respective event (WorkflowRuntime.WorkflowTerminated) for workflow instances.

It is not the LCEH’s responsibility to prevent data loss or to keep the business data consistent. That’s what transactions are for. The job of the LCEH is to

  • prevent vanishing workflow instances
  • prevent inconsistent WF instance state
  • prevent unnoticed errors

To accomplish this it should

  • ensure the exception is properly logged (e.g. within the windows event log)
  • ensure the workflow instance is in a defined state (e.g. set the state to „What the heck?“)
  • ensure the exception is properly announced, e.g. write some additional error information to the workflow instance state table.

Again, this is not about preventing errors; it’s about making them apparent.

Restart After Fixing the Bug

OK, it happened. The child jumped headlong into the wishing well, it screams and nobody can pretend not to have noticed. Now what?

If you do nothing, the child will scream its head off, but that’s it. Talking about software, this is what usually happens. Talking about children and data this is not enough. Of course you should get the child out, clean it, comfort it, and make sure it cannot fall in again. Or you could leave it where it is and start a new one to fill the gap. With children you’ll probably (… no? … 😉 ) opt for the first choice, with software the second one is usually simpler to implement.

So what do you do if you are the admin and you have a workflow instance in an error state? The cause is diagnosed, the bug is fixed. Actually you could simply go on and continue at the location just before the error happened. Unfortunately it is not quite that easy to do that (time shift wasn’t included in the wishing well example). So we have four options, but none is actually pleasant:

  1. Ignoring the situation. This may actually serve you quite well, as long as errors can be diagnosed and mended by someone manually. But if this is too error prone, if it happens too often, or if you have to adhere to some compliance rules (GAAP, legal issues, etc.) you’ll have to think about a better solution.
  2. Announce the problem. Do nothing about the workflow instance but sent an email to all people involved, telling them what happened and asking them to start all over again. And kill the child, it’s not needed anymore and it’s become an annoyance.
  3. Continue the workflow instance. This is what your customer will probably ask you to do. And it is by far the most complicated and error prone option. You’ll have to anticipate any error and provide loopbacks to any regular shape within the workflow.
  4. Start all over. The current workflow instance is flawed, dump it. And start a fresh workflow instance automatically (more or less).

Lets ignore option 1 from now on, we got here because it’s not sufficient, anyway. It goes without saying that any other option would need some kind of UI, showing the admin the invalid workflow instances and offering respective means to mend the issue accordingly.

Option 2 (telling about the failure) is by far the most simple one and takes the least effort. Go for it if you can. Once the audience of you application becomes bigger and/or errors more regular, option 2 will no longer be sufficient. Forget about option 3 (continuing the workflow instance). It is not feasible for any non-trivial workflow. Period. Rather option 4 (re-running the workflow) should be your choice. I’ll dedicate a separate post to this pattern.

You may run into a situation where even option 4 does not fulfill the demand, e.g. if it cannot rely on “stale” data gathered before. This may call for a dedicated correction workflow. Try to avoid this situation by falling back option 2. Some errors simply cannot be mended automatically.

Error Feedback

If your web application fails, it’ll show an error page — more precisely if your web application fails synchronously. But what about failures that happen asynchronously. Failures that happen during timeout handling (exactly 2:13 AM). Of course our LCEH makes sure all the gory details are in the event log or any other log file, ready to be collected by the administrator and analyzed by the next available developer. But what about the end user? What is the substitute for the error page in this case?

The end user certainly needs no error page that pops up when he starts the web application and tells him that (any) one workflow instance {144215AE-28EB-4c36-81F6-7A0F9EC69F76} failed last night. What he needs… well, what I expect if one of my vacation requests failed, is:

  • Some way of identifying the vacation request. In the long list of my vacation requests in the web application it should stand out, e.g. it could be in a certain error state marked red.
  • Some way of understanding what’s going on and what I am supposed to do. E.g. there could be an “info” column in the list of vacation requests, saying something like “There was an error writing the vacation to the payroll system. The administrator has been notified. Since he is on vacation and won’t be back in time you may just forget about your vacation.”
  • Some way of being notified of the problem, why else would I look into the web application in the first place. Read “email”, and it could tell me exactly the same, the web application does. (No reason to omit either!)

Easier said than done? Well, yes. The challenge here is that I expect a lot to happen under error conditions. What if the error occurred during sending an email? Send another one to tell me that emails cannot be sent?

The reality is that it does not exactly make sense to put that much effort in code that should by all means never run. The realistic approach would be:

  • Make vacation requests (i.e. any data that is subject to asynchronous processing) in temporary and error states clearly distinguishable in the web application. This might mean introducing yet another group of states, the “I’m going to do something and the user should never see the intermediate state because it’s overwritten after I have done my job properly”-states.
  • Make sure somebody gets notified. If there is a danger of losing data or producing inconsistent data, at least try to send an email to the administrator. Or shut down and make him look for the cause.

To summarize this post in broader terms: It is futile to try to mend every error condition. Rather than trying to join Sisyphus, try to make errors apparent. Entering an error state is far better than insufficiently trying to handle (and effectively obscuring) it.

I didn’t plan that but I think it makes sense to spend the next post on the replay pattern…

PS: Finally a recommended post, quote “Programmers often have a misconception that their software should always work.”

That’s all for now folks,
AJ.NET

kick it on DotNetKicks.com

November 4, 2007

Robust Operations

Filed under: .NET, .NET Framework, Software Architecture, Software Development, WF — ajdotnet @ 4:26 pm

I guess you don’t want to know about my latest “Vista vs. Windows XP on one machine” experience. Suffice it to say that DOS and FDSIK saved my day. And that it reminded me that the next post should be about robustness

Note: This is part of a series:
1. Talking WF (introduction)
2. Workflow Instance State Management
3. Workflow Communication and Workflow Communication Clarification
4. Hosting WF
5. Talking to the Windows Service
more to come (error treatment).

Last week at the coffee machine…

David Developer: „Hey Mrs. Accounting. That vacation of mine wasn‘t properly included in my salary!“

Alica Accounting: „I know of no vacation. And I really don‘t have time for this. I have to track down Carl Consultant. He went on a vacation that actually was rejected!“

Peter Production (munching a donut): „Oh, fat muft all haff happened laft monf … (gulp) … excuse me, when I had to reboot a few servers. Probably Vacations@SDX tried to access the payroll system when it was down. I told you, didn’t I?”

Alica Accounting: „Yes?“ (like in „What the heck is this guy talking about?“)

Peter Production: „Well, the application should have sent an email about that failure, but Exchange was also down. (slowly walking out) But since Paul Projectmanager agreed, the vacation was listed as granted. (From the corridor) And Carl Consultants vacation request somehow vanished. (Gone)“

Alica Accounting: „Who built such a mess?“

David Developer: „Oh, I … *aham* … I really don‘t know…“

People pissed off, money involved, considerable cleanup work required. Do you want to be David Developer in this case? Do you need more motivation to think about robust processing and error handling?

Robust processing (i.e. preventing errors) and error handling are actually two sides of the same coin. What’s needed to alleviate the issues implied in the above conversation is as follows:

Preventing Errors:

  • Calling another application semantically synchronously – and dealing with unavailability (the payroll system, as synchronous acknowledge is crucial for the workflow)
  • Calling another application semantically asynchronously (the email system, as email is inherently asynchronous)
  • Doing several things semantically in a transactional fashion – and breaking transaction contexts if expedient.

The reason I stressed the word semantically should become clear in a minute. Please note that we cannot prevent the erroneous behaviour (e.g. unavailability); but we can prevent it to become a showstopper for our application. Only should that effort fail we’ll have to deal with the error (which I’ll defer until the next post).

Calling Another Application Synchronously

Many things are done synchronously because it‘s the natural way to do – and because it‘s so easy. Some things on the other hand have to be done synchronously despite the fact that it‘s far from easy. Talking to another application usually falls in the second category… .
When talking to another application you‘ll have to make sure the call actually get‘s through. If it does not, retry! (And with this your simple call becomes a loop). If there is no chance, fail. But don‘t fail like „throw Exception and wreck the workflow instance“, fail like „execute this branch of the workflow that puts the workflow in a defined error state, to be mended by some operator and even clearly visible within Vacations@SDX (nothing fancy, a small neon light and a bell will do)“. If you actually manage to get a call through, be sure it was processed correctly. And correctly does not only mean „no exception thrown“. It also includes „sent back the correct reply.“ If you get some unexpected return value, one that you haven‘t been told about when you asked for the contract (and thus you cannot know whether it is safe to go on), again fail!

The crux is: Failing to call some other system, a system that may be offline, is no technical issue. It‘s something to be expected and is has to be handled accordingly. Also applications evolve (and nobody will bother to tell you, just because you happen to call that WebService) and again, this is to be expected!

Internal WebServices may not need this kind of caution but I would apply this rule to any external application, WebService, whatever API.

Calling Another Application Asynchronously

Sending an email is an inherently asynchronous operation. If all goes well you talk to the mail gateway and that‘s that. No way to know whether and when the mail is received. So why even bother if the gateway accepted the mail? As long as it will eventually accept it?

Suppose you just put the email in a certain database table and go on as if you did everything you had to do. No special error handling, no external dependency. Nice and easy.
Suppose there is a Mail Service (e.g. a Windows Service that you just wrote) that regularly picks up mails from said table and tries to send them, one at a time. If the mail gateway was not available it would retry. If there was an error it would notify the admin.

„But how does my application know whether the Mail Service actually did sent the email? Don‘t I need some kind of feedback?“

What for? Email is unsafe, even if the gateway accepted it it might still not reach the recipient. So why bother if something bad happened at the initial step?

Side note: This Mail Service would be „the last mail access to write“, and it would add substantial robustness to sending emails. Advantages include:

  • asynchronous processing and load levelling (even in the case of mass emails)
  • application independence (if you happen to have more than one application in need of sending emails)
  • just one mail gateway account, infrastructure hassle just once.
  • the option of adding additional features (routing incoming reply emails to the respective application, regularly resending unanswered emails, escalation emails, …)

The crux here is: If something allready is asynchronous, don’t bother making the call to the system foolproof. Factor it out and let some dedicated service handle the gory details about seting up the connection, retry scenarios, escalation, etc.. This pattern also holds for other use cases: a file drop to some network drive, sending an SMS, accessing a printer.

Doing Several Things in a Transactional Fashion

transactionscopeactivityTransactions are nothing new. In code you use some kind of begin transaction/end transaction logic, in WF there is a respective shape to span a transaction boundary (TransactionScopeActivity). Use it! A transaction scope activity not only spans the contained logic, it also adds a persistent point (see Introduction to Hosting Windows Workflow Foundation).

Enter „informal“ transactions…

Suppose neither the payroll system nor the email gateway support transactions. The textbook answer for non-transactional systems would be „compensation“. But if the payroll system needs special permission or treatment to undo an entry? And how do you fetch back an email?

In my experience compensation at this technical level is largely a myth. If anything, compensation is a complicated business process that takes its own workflow to handle.

Obviously the transaction boundaries we assumed do not quite work in these cases. But rather than trying endlessly to make it work, we might take a step back and rethink the state model. Do we have to go directly from „waiting for approval“ to „finished“? „Waiting“ and „finished“ are business states, states that describe interaction with or feedback for some end user. But nobody does deny us the option to introduce intermediary states like „granting vacation (processing payroll update)“ and „granting vacation (sending notification emails)“. This would effectively cut the the former transactional scope in pieces, thus eliminate the immediate need for ACID transactions or compensation.

What do we gain from this? Should anything happen along the way, the state would remain in one of the intermediate states. It would not enter the „vacation granted“ state and effectively lie to us. And what‘s more, the state would actually tell us what part of the process failed, the payroll system or the email. And it would be obvious, clearly calling out in the list of open vacation requests.

The crux here is: Don‘t try to mirror business transactions to technical transaction boundaries. Don‘t do several fragile things in one step, separate them.

Feeling better? More confident? Sorry to disappoint you, but no matter how hard you try, eventually there will be errors. And how to address them will be covered in the next post.

That’s all for now folks,
AJ.NET

kick it on DotNetKicks.com

October 14, 2007

Hosting WF

Filed under: .NET, .NET Framework, C#, Software Architecture, Software Development, WF — ajdotnet @ 2:37 pm

Note: This is part of a series:
1. Talking WF (introduction)
2. Workflow Instance State Management
3. Workflow Communication and Workflow Communication Clarification
more to come (error management).

Hosting WF is easy. Real Easy. You can host it anywhere you want. Console applications, ASP.NET applications, windows services, shopping bags, closets, … . Well. It turns out that most available samples are too simplistic for my taste. Hosting for the sake of it may be easy, hosting in real world scenarios has its pitfalls.

Hosting in IIS

Given that our example application is a web application, the first attempt of hosting WF might be somewhere within the web application itself. Actually this is quite simple and covers all but one demand. Due to the IIS architecture it is even a very robust solution, one that can easily be adorned with an IIS hosted WCF service interface, again backed by IIS, namely the security features. Just great. Apart from the „but one demand“: IIS hosted code relies on a request! Not a “current” request, just any request that keeps the appdomain running. Should the appdomain shut down and go to sleep for the rest of the night, there is no running WF engine. What if a workflow is supposed to timeout somewhen during the night? What if it is supposed to regularly poll a directory or database table? Nothing of this happens until the next request comes in, wakes the appdomain and eventually starts the WF engine. If you can live with the time lag, go for IIS. If not, IIS is out of question.

And of course our example application cannot live with that time lag. There regularly won‘t be a vacation request for days, sometimes weeks. No timeouts and reminder mails? No way!

Hosting in a Windows Service

I‘ve seen various developers at this point thinking of alternatives: Regularly trigger the web application to revive the appdomain… Hosting the WF in a console application… A winforms application… . Eventually I realized that quite a few developers just try to avoid the obvious solution: Windows Services (as in NT Service).

Why avoid them? Because Windows Services are weird. And demanding. They have to be installed and started via the service controll manager (SCM). You have to deal with service accounts and permissions. Windows Services are supposed to be multithreaded. They have no decent UI other than some logs. They have to be robust—and if you think your current application is robust, a Windows Service has to be robust^3. For example:

  • A Windows Service may start at boot time and shut down 3 years later.
  • A Windows Service may have to talk to a database (or other resource) that occasionally goes off line for maintenance reasons.
  • A Windows Service may become what I call a zombie if the worker thread terminates but the service keeps running.

Cheer up, with .NET everything gets better. The .NET Framework devotes a whole namespace to Windows Services: System.ServiceProcess. The most important class is ServiceBase which acts as base class for your own Windows Service implementation; actually it’s a thin layer over the Service Control Handler Function. There are also classes for installation or controlling other services. So there already is some very welcome support available.

Additionally there are quite a few examples available on how to use these classes to host WF, usually in conjunction with WCF (if the WF is not hosted in your web application you need some means to talk to it, we’ll get to that, too). The most simplistic implementation would start and stop the WF engine in the respective SCM commands (i.e. ServiceBase.OnStart and ServiceBase.OnStop respectively). It really doesn‘t need more to implement a valid windows service hosting WF. Well, for the sake of it that may be true. But real world demands? Are these examples actually production ready? Do they fulfill the demands regarding operations, robustness, etc.? Not in my opinion.

Better Hosting in a Windows Service

ServiceBase has no special support for worker threads. No support for any runtime diagnosis (such as a ping or heartbeat). ServiceBase is actually prone to become a zombie, because an exception during any SCM command (such as pause) will be written to the event log but it will _not_ stop the service. However this cannot be called „robustness“ (if it were defined that way) because an exception in another thread will turn down the whole process, including all services contained in the same EXE and registered with the SCM.
Actually these are all fairly generic issues for any Windows Service and not dependant on the work the service actually does. Therefore its quite easy to come up with a reusable Windows Service framework on top of ServiceBase.

And here is the pattern to be implemented by such a framework (based on two real world projects):

  • Restrict your EXE to one Windows Service. We want the service to go down if something bad happens; and dragging innocent services into death just because they happen to run in the same process won’t do.
  • Don‘t do your work in ServiceBase.OnStart. Use it to start a separate thread, acting as watchdog. Notify that thread about pause, continue, stop, and the other SCM commands. If an exception is raised during an SCM command, again, kill the service (you can allways do that by starting another thread that raises an exception).
  • The watchdog thread should start the engine and afterwards enter a loop. It should leave the loop if it receives the stop request, stopping the engine before it finishes.
  • Within the loop the watchdog thread should regularly check whether the engine is still operating (hence the term “watchdog”). If the engine fails for some reason the watchdog thread may try to compensate (restart the WF engine). E.g. if the engine failed because the databases went off line, it may be an option to wait for a certain amount of time. If the database server was just rebooted the databases may come online in a few minutes. Anyway, if that is not possible, kill the service.
  • The watchdog should also maintain a heartbeat, say trigger a performance counter that tells the operator the Windows Service is healthy — even if the system doesn‘t do anything worthwhile right now.

The reason for killing the service is simple: A Windows Service that stopped working is more obvious and will be noted far earlier than one that just wrote an event log entry and kept lingering around. Also a stopped service usually can be tracked by operations software such as NetView.

Now we have a service that fullfills basic demands of robustness. There is more to say about robustness, but before that we need a means to talk to the workflow, now that we cut it out of the web application. Next post.

That’s all for now folks,
AJ.NET

kick it on DotNetKicks.com

October 6, 2007

Workflow Communication Clarification

Filed under: .NET, .NET Framework, ASP.NET, C#, Software Architecture, Software Development, WF — ajdotnet @ 2:47 pm

My last post may need a clarification…

Tim raised the question whether it is such a good idea to synchronize the communication between web application and workflow instance. To make it short: Generally it isn’t! Period. It hurts scalability and it may block the web request being processed. Therfore from a technical perspective it usually is preferable to avoid it. Ways around this include:

  • Redirect the user to a different page telling him that his request has been placed and that it may take a while until it is processed.
  • Ignore the fact that the current page may show an outdated workflow state, hope for the best and take means that another superfluous request won’t cause any problems.
  • ….

However sometimes you cannot avoid having to support a use case like the one I presented. But even then — if the circumstances demand it — it might be possible to avoid blocking on the workflow communication. E.g. the web application could write a “processing requested…” state into the worflow instance state table (contrary to what I wrote in an earlier post, but that was just one pattern, not a rule either). The actual call could be made asynchronously, even be delayed somehow.

If you still think about following the pattern I layed out, make sure the processing within the workflow instance is fast enough to be feasibly made synchronously. E.g.:

sequence_1

As you can see, synchronization encapsulates the whole processing.

What if the processing takes longer? Well, the pattern still holds if you introduce an “I’m currently processing…” state. The callback won’t tell you the work has been done any longer, but it will still tell you that the demand has been placed and accepted by the workflow instance.

sequence_2

In this case synchronization encapsulates only the demand being accepted, not the processing itself. However it still serves the original purpose, which was telling the user that his request has been placed.

What if you need to synchronize with the end of a longer running task? In that case this pattern is not the way to go. The user clicking on a button and the http post not returning for a minute? This definitely calls for another pattern.

I hope this has made things clearer.

That’s all for now folks,
AJ.NET

kick it on DotNetKicks.com

October 2, 2007

Workflow Communication

Filed under: .NET, .NET Framework, C#, Software Architecture, Software Development, WF — ajdotnet @ 8:36 pm

Note: This is part of a series:
1. Talking WF (introduction)
2. Workflow Instance State Management
more to come (hosting, error management).

The last post brought up the following topic:

“I clicked a button and somehow the workflow took over and did something. As you know, this happens asynchronously, nevertheless I saw the correct state after the processing instantly. To make that work, the web application has to have a means to know when the state has changed. Workflow communication.”

So, how do you talk to a workflow instance? How do you know it processed your request? How does it talk back?

Please note: Talking to a workflow is a topic that is actually well documented. Yet it still has its pitfalls and it takes a little getting used to. Therefore I’ll recap the pattern, trying to explain it along the line. And I will assume a certain use case (the synchronization issue) because it showed up regularly in our use cases.

If you are new to WF you’ll have to master two or three things to talk to a workflow: The service pattern, a silly interface pattern, and some threading issues.

The service pattern

If you read an article about WF, the service pattern is most likely not introduced properly; most articles simply say “do this, and you’ll accomplish that”. The service pattern has been widely used for visual studio design time support but that’s the only usage I am aware of. Until WinFx came around, that is. Firstly WF uses the service pattern for runtime services such as persistence. In this case the developer only has to announce existing service implementations. Secondly it also employes services for data exchange — in which case the developer has to deal with the service pattern in all its beauty.

For the record: The service pattern usually defines a service as an interface. It decouples the consumer of a service and the provider. A certain service may be optional (such as the tracking service) or mandatory (such as the scheduler service). If the consumer needs a particular service, it’ll ask the provider. Usually providers form a chain, so if the first provider does not know the service, it’ll ask the next one.

And no, it’s got nothing to do with web services…

You might want to read Create And Host Custom Designers With The .NET Framework 2.0 Chapter “Extensibility with Services” for a slightly more elaborate description.

For data exchange the developer has to define an interface (the service contract) and a class that implements the interface (the service). During startup he creates an instance of the service and registers it with the data exchange service (a service provider in itself). If you need it in an activity to do something worthwhile you ask the ActivityExecutionContext (the service provider at hand, see Activity.Execute) for the service. Then you work with it.

Some activities already provide design time support for services and the activities for data exchange are among them (HandleExternalEventActivity and CallExternalMethodActivity):

activity2  prop2

activity1   prop1

Please note that many of the predefined activities offer other anchors to attach your own code, events and the like, that do not provide easy access to the ActivityExecutionContext, thus no service provider is readily available.

A silly interface pattern

When I first wanted to talk to a workflow, I knew I need an interface. To call into the workflow I expected a method; to get notified from the workflow I expected an event. Of course I expected to call and be called on my own thread. Guess what… .

Well, my attitude actually was “I – the code of the host application – create the workflow, I’m in charge, I define the point of view.” The reality of course is “I – the workflow – do not care what the host application code thinks who he is. I do the work, I’m in charge, I define the point of view.” This turns my initial expectations upside down. In other words: A call into the workflow instance notifies the instance of some external demand, thus it is made via an event. If the workflow instance has to say something it straightly calls a method. And it does not care what thread is involved.

[ExternalDataExchange]
interface IVactionRequestWorkflow
{
    // call into workflow instance
    event EventHandler<ConfirmationEventArgs> ConfirmVacationRequest;
    
    // callback from workflow instance
    void ConfirmationProcessed(bool confirmed);
}

Silly, isn’t it?

Some threading issues

communication1To elaborate more on the threading issue: We know that WF is inherently multithreaded. When you call into a workflow instance (pardon, when you signal it), the WF engine (more precisely the data exchange service) will catch the event on your thread, do some context switch magic, and signal the workflow instance on its thread. (It might have to load the workflow instance, which is just another reason for the decoupling.) If the workflow instance on the other hand has some information it calls your code on its own thread — the WF engine can hardly hijack your thread for some automatic context switch. Consequence: You signal the workflow instance and the call (more or less) immediately returns. But you don’t know when the workflow will actually get signaled, much less when it will have done the respective processing. You may issue a call from the workflow instance but you will have to care for the context switch and the proper reaction yourself.

This is pretty good for performance and scalability. But if you need synchronous behavior (as in our example) it hurts a bit.

To make a long story short, here’s the pattern we came about with:

  • communication2The interface contains a CallIntoWorkflowInstance event and a respective CallIntoWorkflowInstanceAcknowledge callback method (ConfirmVacationRequest and ConfirmationProcessed in the interface above)
  • The workflow instance accepts the CallIntoWorkflowInstance event, does some work (supposed to be synchronous) and calls CallIntoWorkflowInstanceAcknowledge to tell you it finished its part.
  • The service implements the event (along with a SignalCallIntoWorkflowInstance helper method) and the method. It also has an AutoResetEvent for the synchronization.
  • The helper method (called from the web application thread) raises the event and afterwards waits for the AutoResetEvent.
  • The callback method (called from the workflow instance thread) simply signals the AutoResetEvent. This will cause the helper method on the other thread to be unblocked and to return.

The service implementing the above interface consequently looks like this:

class VacationRequestWorkflowService : IVactionRequestWorkflow
{
    AutoResetEvent _confirmationProcessed = new AutoResetEvent(false);
    
    public event EventHandler<ConfirmationEventArgs> ConfirmVacationRequest;
    
    public void OnConfirmVacationRequest(ConfirmationEventArgs ea)
    {
        if (ConfirmVacationRequest != null)
            ConfirmVacationRequest(null, ea);
    }
    
    public void ConfirmationProcessed(bool confirmed)
    {
        _confirmationProcessed.Set();
    }    public void SynchronousConfirmVacationRequest(WorkflowInstance instance, bool confirmed)
    {
        // send event
        OnConfirmVacationRequest(new ConfirmationEventArgs(instance.InstanceId, confirmed));
        // wait for callback
        _confirmationProcessed.WaitOne();
    }
}

Now we have synchronized the workflow instance with out calling application. Of course we rely on the callback or we would wait endlessly, but that’s another story. 

The next post will take a closer look at hosting.

That’s all for now folks,
AJ.NET

kick it on DotNetKicks.com

September 30, 2007

Workflow Instance State Management

Filed under: .NET, .NET Framework, C#, Software Architecture, Software Development, WF — ajdotnet @ 4:34 pm

Note: This is part of a series:
1. Talking WF (introduction)
more to come (communication, hosting, error management).

Let’s skip the “how do I reference the necessary assemblies” stuff and assume the boilerplate work of setting up the WF engine and starting workflows has been done. (There’s plenty documenatation about this available elsewhere.) So let’s asume we have a page that starts workflow instances and the workflow is in place and running. Now, when you start thinking about a workflows you‘ll sooner or … well, even sooner, ask questions like „How dows this piece of information get to my workflow instance?“ and „How does that piece get out?“. In one word: data exchange.

Let‘s make that more tangible with an example: Let’s asume I‘m the project manager and I just got a notice that the guy across the room wanted to go on vacation when I planned going-live. Damn his insolence! How do I cancel his request?
I start up Vacations@SDX, select the page that gives me a list of open vacation requests, and click the „cancel vacation“ button for the one in question. The page refreshes and the state of the vacation request becomes „canceled for undue insolence“ (Oh boy, if any of that happend, my collegues would kill me. But nastiness is so much fun 👿 ).

There a two things that happend here that are noteworthy:

  1. I saw a list of open vacation requests. And each request should be represented by a running workflow instance. Did I see a list of workflow instances? Workflow instance state management.
  2. I clicked a button and somehow the workflow took over and did something. As you know, this happens asynchronously, nevertheless I saw the correct state after the processing instantly. To make that work, the web application has to have a means to know when the state has changed. Workflow communication.

Workflow instance state management

The workflow instance has state. It knows about data somehow coming in, knows what activities have been executed and so forth. Therefore it would be possible to query all workflows, load them, and ask about certain details, e.g. read some properties. Wait, the workflow instance might process something asynchronously… . How do you synchronize? What if the workflow just terminated because you loaded it and it somehow finished? Do you really want to load thousands of workflow instances for every query? (OK, that‘s probably beyond the example application, but still…).

Once you‘ve thought along that line you‘ll realize that you need to maintain the business state (the vacation request in our example) outside of the workflow instance, i.e. in a database table. But now you have to face the problem of keeping the data in this table and the related workflow instance in sync. Here you have two feasible options:

  1. Keep the workflow instance oblivious of that table, let the web application read and write into the table (kind of a cache).
  2. Let the workflow maintain its state in the table directly, let your web application read the table.

Some people might vote for option 1, and with good reason. It keeps the business logic in one place, the web application; you don‘t have to deal with technically concurrent database access in semantically single „transactions“). It clearly limits the workflow instance to workflow stuff, which promotes separation-of-concerns.

wf_stateAll well and good. But I go for number 2. For two reasons: Number 1 would imply more communication and synchronization between workflow instance and web application (we‘ll look into that). And the main reason: If my workflow is long running, it might change its state (say to „timeout because no one cared“) at a time when no user is online and the web application is fast asleep. With number 1 some other piece of code would have to do what the web application already does.

The pattern we came to appreciate goes like this: 

  • Mirror the workflow state in a table (containing the state and the workflow instance ID)
  • Let the workflow maintain that state
  • Let the web application query that state
  • Provide cleanup, e.g. remove the respective row when the workflow finishes.
  • Ensure consistency (e.g. put in validation code that verifies that the information within the table is consistent with the workflow state, provide some diagnostic tool that checks whether the workflow instances still exists)

This way it should be far easier to provide overview lists. And to interact with workflow instances from your web application you would need a table with workflow IDs anyway.

The next post will look into the workflow communication topic, the actual data exchange with the workflow instance.

That’s all for now folks,
AJ.NET

kick it on DotNetKicks.com

September 22, 2007

Talking WF…

Filed under: .NET, .NET Framework, ASP.NET, C#, Software Architecture, Software Development, WF — ajdotnet @ 5:11 pm

WF (for the record: Windows Workflow Foundation) is great. If you need it, you really need it. If you have the engine and one workflow running, adding other workflows is no sweat at all. If your workflows are persisted, you won‘t have to care for shutdowns. If you carefully designed your workflow you can use the graphical view for discussions with the user. If you set it up correctly, even errors can be handled by workflows.

Well, WF is great. But there certainly are a lot of “if”s here…

I was at the PDC 05 where Microsoft introduced the WWF (later to be renamed to WF due to some concerned Pandas 😉 ) as THE big surprise. But I had to wait until earlier this year to get my “production hands” on it. Curiously that happened more or less simultaneously in more than one project. And curiously, despite the different nature of the task at hand they turned out to be very similar. Similar in terms of the problems, similar in terms of the pitfalls, similar in terms of the solutions. And also very similar in the demand for knowledge of things that are not WF specific but are far from common knowledge for the average developer… (and with that I mean the average knowledge areas, not developers of average knowledge… did that make sense? I have no intention to disqualify someone for not knowing all the stuff that‘s following!)

Curiously^3 while there is so much information available about WF—hosting, data exchange, and what else—it still did not answer our questions. I rather got the impression that my problems were largely ignored. Or if they were mentioned it stopped right there. “Here is something you will have to think about!” Period. “You should use XY.” Period. “That will be a problem.” Period.

So I had to rely on my BizTalk background to fill the gaps (and my knowlegde about multithreading even goes back to OS/2, sic! Am I in danger of becomming a dinosaur?). In the following posts I will dig in one or the other topic and some of the “best practices” (if it can be called that) we came about with, which might help the next developer.

vacationBut for now let‘s just lay out the example application:

At SDX people work for various customers, usually on-site. If I would like to go to vacation I might have something to say about the matter myself. So does the customer. So does my account manager. And the back office. And… you get the idea. To manage that you need an informal workflow at least, once the head count of your company goes up you need it formalized. And this is what our example application (let‘s call it Vacations@SDX) does:

  • Show me how many vacation days I have left.
  • Let me request a new vacation.
  • Let all people concerned with my vacation have their say.
  • If everyone involved acknowledged the vacation write the information to the payroll application.

Of course since most people are mostly out of office, the application has to be a web application, the workflow long running, and notifications done by email. And the payroll application could be offline. And most importantly: DON‘T MESS WITH MY VACATION! Losing my request somewhere along the way simply WON‘T DO!

Nothing special here at all. Some kind of document review, order processing, or whatever workflow would have similar demands. Actually this is a very simple real world example. And yet, if done correctly it is a far from simple undertaking. Certainly it is not as simple as drawing a workflow and hooking up the basics as described in the available documentation.

On the list of things to talk about is workflow instance state, workflow communication, hosting, and error handling. This will be fairly basic stuff from an “architectural” point of view, but I will assume that you know or read up about the WF basics. I‘m not going to repeat the dozen lines of code it takes to host WF in a console application.

If that sounds interesting for you, stay tuned.

That’s all for now folks,
AJ.NET

Blog at WordPress.com.