The Power of Later

Steve Vinoski nominates RPC as an historically bad idea, yet the synchronous request reply message pattern is undeniably the most common pattern out there in our client-server world. Steve puts this down to “convenience” but I actually think it goes deeper than that. Because RPC is actually not convenient – it causes an awful lot of problems.

In addition to the usual problems cited by Steve, RPC makes heavy demands on scalability. Consider the SLA requirements for an RPC service provider. An RPC request must respond in a reasonable time interval – typically a few tens of seconds at most. But what happens when the service is under heavy load? What strategies can we use to ensure a reasonable level of service availability?

  1. Keep adding capacity to the service so as to maintain the required responsiveness…damn the torpedoes and the budget!
  2. The client times-out, leaving the request in an indeterminate state. In the worst case, the service may continue working only to return to find the client has given up. Under continued assault, the availability of both the client and the server continues to degrade.
  3. The service stops accepting requests beyond a given threshold. Clients which have submitted a request are responded to within the SLA. Later clients are out of luck until the traffic drops back to manageable levels. They will need to resubmit later (if it is still relevant).
  4. The client submits the request to a proxy (such as a message queue) and then carries on with other work. The service responds when it can and hopefully the response is still relevant at that point.

Out of all these coping strategies, it seems that Option 2 is the most common, even when in many cases one of the other strategies is more efficient or cost-effective. Option 1 might be the preferred option given unlimited funds (and ignoring the fact it is often technically infeasible). In the real world Option 2 more often becomes the default.

The best choice depends on what the service consumer represents and what is the cost of any of the side-effects when the service fails to meet its SLA.

When the client is a human – say ordering something at our web site:

  • Option 2 means that we get a pissed off user. That may represent a high, medium or low cost to the organization depending on the value of that user. In addition there is the cost of indeterminate requests. What if a request was executed after the client timed-out? There may be a cost of cleaning up or reversing those requests.
  • Option 3 means that we also get a pissed off user – with the associated costs. We may lose a lot of potential customers who visit us during the “outage”. On the positive side, we minimise the risk/cost of indeterminate outcomes.
  • Option 4 is often acceptable to users – they know we have received their request and are happy to wait for a notification in the future. But there are some situations where immediate gratification is paramount.

On the other hand, if the client is a system participating in a long-running BPM flow, then we have a different cost/benefit equation.

  • For Option 2, we don’t have a “pissed off” user. But the transaction times out into an “error bucket” and is left in an indeterminate state. We must spend time and effort (usually costly human effort) to determine where that request got to and remediate that particular process. This can be very costly.
  • Option 3  once again has no user impact, and we minimise the risk of indeterminate requests. But what happens to the halted processes? Either they error out and must be restarted – which is expensive. Alternatively they must be queued up in some way – in which case Option 3 becomes equivalent to Option 4.
  • In the BPM scenario, option 4 represents the smoothest path. Requests are queued up and acted upon when the service can get to it. All we need is patience and the process will eventually complete without the need for unusual process rollbacks or error handling. If the queue is persistent then we can even handle a complete outage and restoration of the service.

So if I am a service designer planning to handle service capacity constraints, for human clients I would probably choose (in order) Option 3, 4 and consider the costs of option 2. For BPM processes where clients are “machines” then I would prefer Option 4 every time. Why make work for myself handling timeouts?

One problem I see so often is that solution designers go for Option 2 by default – the worst of all the options available to them.

Ian Robinson on Coupling

In my opinion, coupling is the most fundamental attribute of a system architecture and tight coupling is probably the most common architectural problem I see in distributed systems. The manner in which system components interact can be a chief determinant of the scalability and reliability of the final system.

So I really like Ian Robinson’s post on Temporal and Behavioural Coupling where he uses two coupling dimensions and the inevitable magic quadrant to classify systems based on their degree of temporal and behavioural coupling.

See Ian’s post for the slick professional graphics, but to summarise – event-oriented systems with low coupling  occupy the “virtuous” third quadrant of the matrix. Conversely the brittle “3-tier” applications that many of us struggle with, occupy the “evil” first quadrant where coupling in both dimensions is high.

However I’m a little miffed to see no mention of my favourite “document-oriented message” in Ian’s diagram. As Bill Poole writes; document messages have lower behavioural coupling than command messages, but more than event messages. So would you put document-oriented messages near the middle top of the matrix between command-oriented and event-oriented messages? Unfortunately that would break the symmetry. But it also highlights another problem.

Any type of message – document, command or event-oriented could temporally be tightly or loosely coupled. Temporal coupling is more a property of the message transport than of the message type. So I suggest that the two coupling dimensions are characterised as follows:

  • Temporal coupling – characterised by message transport from RPC (tight coupling) through to MOM (loose coupling).
  • Behavioural coupling – characterised by the message type from event-oriented (tight) through document-oriented to event-oriented (loose).

It so happens that distributed 3-tier systems generally employ both command-oriented messages and RPC transports – hence making them inherently “evil”. Whereas events (being asynchronous)  are naturally virtuous by typically being carried over MOM transports (it’s difficult to request an event notification).

Between heaven and hell, it is in the murky mortal realms of SOA where we need to be constantly mindful of the interactions between message type and transport – lest our system ends up in limbo.

The Architectural Role of Messaging

JMS has brought messaging more into the mainstream which is a good thing. But just like any new technology there is the danger that the first implementations will reflect older paradigms. I remember when I made the move from FORTRAN to C, for a while I wrote a lot of FORTRAN programs in the C language syntax until I got more familiar with C features and idioms. The same goes for my more recent ventures into Ruby with many years of Java thinking under my belt.

Coming back to JMS, I find when I review distributed applications that have been designed by people with a strong client-server or web background I see a lot of rpc message semantics. While rpc (or synchronous request/reply) has it’s place, this is not always the best approach. A common mistake is to regard messaging as simply a way to get messages from point A to point B…treating JMS as a simple transport such as a TCP socket or HTTP.

Messaging originated as the concept of a distributed queue. Most programmers are familiar with queues from GUI frameworks where communications between widgets are mediated via an event queue. The event queue supports a number of functions such as decoupling widgets from each other…allowing each widget to do what it needs to in it’s own time, and supporting event driven interactions between widgets. The event queue along with multi threading is key to giving user interfaces the responsiveness and robustness that you expect. In this way queueing provides more than just a communications mechanism but is key to the architecture of a GUI framework.

The same is true of distributed messaging systems. In their original conception distributed queues do more than just provide a way for data to pass from one system to another, they provide an important element of isolation.

A fundamental difficulty in building distributed systems is that the different components have different performance characteristics. In addition the uptime of your total system is the product of the uptime of individual components. To ensure maximum uptime you want your components to be independent of each other and, in the event of failure you want to be able to restart from where you left off. This is where message queues work really well. Component A puts a message onto a queue and doesn’t  care if or when component B takes that message off the queue. This is known as the fire and forget message pattern and it provides the best isolation between your system components.

If instead we make Component A wait for an acknowledgment from Component B before it proceeds then we are building a tight coupling into the system. Any performance difficulties or failure experienced by Component B could spread back to Component A and thence to other components up the chain.

So the role of messaging in distributed systems goes beyond just getting a message from point A to point B. It also acts as a kind of expansion joint for your system allowing individual components to vary in their performance characteristics – or even fail totally for short periods – without breaking adjacent components.

Without these messaging expansion joints, your system is tightly coupled and prone to system wide failure originating from a single component. Messaging – using the fire and forget pattern – allows these issues to be locally absorbed and managed within normal system operations.

Push versus Pull

From OSCON via O’Reilly Radar here’s a good case study of an architectural decision driven by the system requirements rather than the usual religious considerations that pollute the bloggosphere.

FriendFeed needed update info from Flikr but a REST-based “pull” approach is highly inefficient in this case. Instead the solution architects opted for a “push” approach using xmpp as the message transport. This is a really good presentation because it goes into the architectural choices and implications of “push” versus “pull”.

I characterize this as “pull vs push” rather than “REST vs xmpp” (or “REST vs *” or “why REST is crap”) because fundamentally it comes down to the best choice of how to synchronize changes between systems. You make this choice based on the usage characteristics of the different systems, the likely traffic volumes this will result in and the consequential resource impacts. Having made the choice between push or pull you then choose the appropriate message transport.

The web doesn’t do a lot of “push” and consequently there is not a lot of discussion about push and REST. Dare Obasanjo characterises it nicely:

Polling is a good idea for RSS/Atom for a few reasons

  • there are a thousands to hundreds of thousands clients that might be interested in a resource so the server keeping track of subscriptions is prohibitively expensive
  • a lot of these end points aren’t persistently connected (i.e. your desktop RSS reader isn’t always running)
  • RSS/Atom publishing is as simple as plopping a file in the right directory and letting IIS or Apache work its magic

The situation between FriendFeed and Flickr is almost the exact opposite. Instead of thousands of clients interested in document, we have one subscriber interested in thousands of documents. Both end points are always on or are at least expected to be. The cost of developing a publish-subscribe model is one that both sides can afford.

Inside the firewall, the situation is often more akin to that between FriendFeed and Flikr. This is why messaging is more common inside the firewall than outside – not because of any universal superiority between REST versus messaging, but because the system requirements are different and often favour a push approach rather than pull.

While your over at Dare’s excellent Blog, be sure to also check out his discussion of push versus pull in the context of scaling Twitter and MS Exchange.  These are important considerations for designers of federated systems such as federated databases or federated messaging systems. The example of FriendFeed to Flikr could be considered as the first incremental step toward a federation.

Waiting for the great leap forward

One of the original and fundamental tenets of the SOAP standard was that the SOAP message is independent of the underlying transport. Ostensibly you could use SOAP over HTTP, JMS, email, FTP etc. but the reality is that a standard binding has only ever existed for SOAP over HTTP. To paraphrase Henry Ford – “you can have any SOAP transport you like, as long as its HTTP”.

While HTTP is undoubtedly a good choice for SOAP – given its ubiquity – there is at least one other transport which demands attention. This is the JMS transport which is widely used inside the firewall of many organizations. Of all the companies that I work with, their SOA infrastructure heavily relies on JMS transports inside the firewall, with HTTP transports outside the firewall or to selected service end-points such as web pages. Of course my experience has significant selection effects, but nevertheless JMS is an important transport in many SOAs. Testament to this is that every major web-services product vendor (save Microsoft) supports SOAP over JMS (and even Microsoft now has SOAP over MSMQ as an important part of WCF).

The fly in the ointment is that there has never been a standardized binding for SOAP over JMS and as a result there is little interoperability between SOAP/JMS solutions provided by different platform vendors. If you happen to have any combination of different web-service platforms in your organization, then they cannot easily communicate with each other using SOAP over JMS without performing some unnatural acts.

Some of issues that need to be considered with a SOAP binding to JMS are:

  • How do you represent the message content – text or binary? Most vendors have chosen a text message representation, but that has problems with multi-byte encodings, so other vendors have gone with a byte message representation.
  • What headers do you define and what should their names be? How do you use the standard JMS headers? different vendors have different naming conventions and semantics.
  • In the WSDL description, how do you represent the connection details to the JMS provider?
  • How do your service endpoints manage the different message exchange patterns that are available with message-oriented transports?

Each of the vendors went their own way on many of these issues and as far as interoperability was concerned they basically ceded the field to HTTP. They made life difficult for large organizations with heterogeneous platforms and in my opinion didn’t do themselves any favours on the way. (Actually SOAP-encoding interoperability was so broken for a while that noone noticed the JMS issues…so maybe it wasn’t so bad).

Subsequently it was great to see some of the vendors get together a couple of years ago to agree on a standard SOAP binding for JMS that addresses most of the important considerations. The result was a Member Submission to W3C in September last year. My understanding is that this submission was previously circulated through most of the vendor community so hopefully it has general agreement on the technical details.

This has now taken its first steps to standardization with the initiation of a SOAP-JMS Binding Working Group who aim to publish a recommendation by April next year. Hopefully vendor support of the binding will be hot on its trail.

Note that the standard binding won’t address the fact that different JMS implementations do not interoperate. For example, a TIBCO JMS client will not be able to talk to a Websphere JMS provider because JMS is an API standard, not a wire-protocol standard. What the SOAP/JMS binding standard does mean is that once you have settled on a standard JMS provider for your services, you could define your service description in standard WSDL and your service provider (say Websphere or TIBCO or WSO2) and your service consumer (say TIBCO or WebLogic or Axis) would be able to communicate directly using SOAP over JMS “out of the box”.

Its been eight years (almost to the day) since SOAP 1.1 came out with the HTTP binding. Wouldn’t it be great if a standard JMS binding could be achieved within the decade! It’s been a very long wait. The JMS binding should have happened a lot sooner and I can’t say the “wait has been worth it” but it does fill an important hole in the Web-Services standards.

So what do we do in the meantime? You can eschew JMS altogether and stick with HTTP, but that requires another lot of hard work. You can stick with one and only one service platform, but that is difficult in large heterogeneous organizations – which is where SOA is supposed to provide maximum benefit. Or you can continue to do what many SOA implementers have done and deal with SOAP directly at the JMS layer – effectively using SOAP as plain-old-XML over JMS. I wrote more about this approach recently.

Another thing you can do in the meantime is ask your vendor when will they support the new SOAP/JMS binding?