Entries Tagged 'bpm' ↓

The Power of Later

Steve Vinoski nominates RPC as an historically bad idea, yet the synchronous request reply message pattern is undeniably the most common pattern out there in our client-server world. Steve puts this down to “convenience” but I actually think it goes deeper than that. Because RPC is actually not convenient – it causes an awful lot of problems.

In addition to the usual problems cited by Steve, RPC makes heavy demands on scalability. Consider the SLA requirements for an RPC service provider. An RPC request must respond in a reasonable time interval – typically a few tens of seconds at most. But what happens when the service is under heavy load? What strategies can we use to ensure a reasonable level of service availability?

  1. Keep adding capacity to the service so as to maintain the required responsiveness…damn the torpedoes and the budget!
  2. The client times-out, leaving the request in an indeterminate state. In the worst case, the service may continue working only to return to find the client has given up. Under continued assault, the availability of both the client and the server continues to degrade.
  3. The service stops accepting requests beyond a given threshold. Clients which have submitted a request are responded to within the SLA. Later clients are out of luck until the traffic drops back to manageable levels. They will need to resubmit later (if it is still relevant).
  4. The client submits the request to a proxy (such as a message queue) and then carries on with other work. The service responds when it can and hopefully the response is still relevant at that point.

Out of all these coping strategies, it seems that Option 2 is the most common, even when in many cases one of the other strategies is more efficient or cost-effective. Option 1 might be the preferred option given unlimited funds (and ignoring the fact it is often technically infeasible). In the real world Option 2 more often becomes the default.

The best choice depends on what the service consumer represents and what is the cost of any of the side-effects when the service fails to meet its SLA.

When the client is a human – say ordering something at our web site:

  • Option 2 means that we get a pissed off user. That may represent a high, medium or low cost to the organization depending on the value of that user. In addition there is the cost of indeterminate requests. What if a request was executed after the client timed-out? There may be a cost of cleaning up or reversing those requests.
  • Option 3 means that we also get a pissed off user – with the associated costs. We may lose a lot of potential customers who visit us during the “outage”. On the positive side, we minimise the risk/cost of indeterminate outcomes.
  • Option 4 is often acceptable to users – they know we have received their request and are happy to wait for a notification in the future. But there are some situations where immediate gratification is paramount.

On the other hand, if the client is a system participating in a long-running BPM flow, then we have a different cost/benefit equation.

  • For Option 2, we don’t have a “pissed off” user. But the transaction times out into an “error bucket” and is left in an indeterminate state. We must spend time and effort (usually costly human effort) to determine where that request got to and remediate that particular process. This can be very costly.
  • Option 3  once again has no user impact, and we minimise the risk of indeterminate requests. But what happens to the halted processes? Either they error out and must be restarted – which is expensive. Alternatively they must be queued up in some way – in which case Option 3 becomes equivalent to Option 4.
  • In the BPM scenario, option 4 represents the smoothest path. Requests are queued up and acted upon when the service can get to it. All we need is patience and the process will eventually complete without the need for unusual process rollbacks or error handling. If the queue is persistent then we can even handle a complete outage and restoration of the service.

So if I am a service designer planning to handle service capacity constraints, for human clients I would probably choose (in order) Option 3, 4 and consider the costs of option 2. For BPM processes where clients are “machines” then I would prefer Option 4 every time. Why make work for myself handling timeouts?

One problem I see so often is that solution designers go for Option 2 by default – the worst of all the options available to them.

Progressive Data Constraints

As a follow up to my last post, Richard Veryard described the concept of Post Before Processing whereby rules are applied once you have safely captured the initial information. This is a good way of managing unstructured or semi- structured data and it reminded me of other cases where data is progressively constrained and or enriched during processing.

Content management systems use this technique at the unstructured end of the spectrum where source and draft content is pulled into the system from the “jungle” and then progressively edited, enriched, reviewed etc until the content is published. Programmers will be familiar with this in the way that source code is progressively authored, compiled, unit tested and integrated via source code control (plus a bunch of QA rules and processes).

Many computer-aided design applications also provide this ability to impose different rules through the lifecycle of a process. Many years ago I worked on a CAD application for outside plant management for Telcos which had a very interesting and powerful long transaction facility. Normal mode representing the current state of the network enforced an array of data integrity and business rules – such as what cables could be connected to each other and via what type of openable joints etc.

In the design mode, different rules are in place so that model items can be edited and temporarily placed into invalid states. This is necessary because of the connected nature of the model (a Telco network) and the fact that the design environment reflects the future state of the network which may not correctly “interface” to the current network state. The general mode of operation was multiple design projects created by different network designers who managed the planning and evolution of the network from it’s current state to some future state. And multiple future states could potentially exist within the system at any point in time. Design projects follow a lifecycle from draft through proposed and into “as built”. This progression is accompanied by various rules governing data visibility, completeness and consistency.

This is a useful model of how to manage data which may be complex, distributed or collaboratively obtained and managed. Effectively building a process around a long transaction which manages transition of data between states of completion or consistency.

Master Data Management is another topical example of this type of pattern. In this scenario data is distributed across systems or organizations. Local changes trigger processes which manage the dissemination of changes to bring the various parts of the data into consistency. During this process different business rules may be applied to manage the data in transition.

I think these concepts can be more generally applicable to SOA design-time governance. For example in the collaborative design of enterprise schemas or services contracts.

Talk to the Hand

A common issue  with any kind of BPM implementation or distributed application is the problem of what to do with errors. The most important thing is to direct errors to the correct system or person. “Correct” meaning a system that can actually do something about the error.

Take our internal timesheet application as an example. I’m sure your system is similar, where an administrator sets up a project along with various business rules such as who can enter time against a project, or what is the valid date range for the project. When I enter my time records, I invariably run afoul of one of these rules – who could have guessed the project would run late?

The problem arises when I get a (usually very obscure) error message about one of these constraints. I can’t actually do anything about incorrect dates or ad hoc resource assignments because I don’t own the project.  So it’s no good erroring out to me, the user. The really brain-dead thing is that I have to discard the timesheet entries, send an email to the project owner (whoever that may be) and enter the data later…if at all. It certainly doesn’t help collect accurate and timely information.

A better approach would be to accept the entered data and refer the errors to the project owner who is in a position to do something about it. This person can review the errors and correct the project rules accordingly. Alternatively they can reject the entry because I may have actually broken some valid rules.

This process flow is much more user friendly and streamlined. It requires a couple of important facilities:

You need to suspend business rules for part of the process. My timesheet entries initially break the business rules but we don’t know if that is because I have err’d or because the rules/reference data are in error. We need to accept the entries in an invalid state until the right authority can decide. This represents an example where data validity rules are not absolute, but depend on the process.

Secondly you need an error service…a facility whereby errors when detected can be routed to the appropriate person or system for resolution.

Our timesheet system is based on the old database paradigm – a monolithic data-centric application with only one place for rules to be applied.  The “workflow-based” solution I describe is slightly more complicated than the monolithic application but the payoff in terms of a better user experience leads to more accurate and timely records. And if your a services organization with thousands of billable timesheet records each day, then you know what takes priority.