Steve Vinoski nominates RPC as an historically bad idea, yet the synchronous request reply message pattern is undeniably the most common pattern out there in our client-server world. Steve puts this down to “convenience” but I actually think it goes deeper than that. Because RPC is actually not convenient – it causes an awful lot of problems.
In addition to the usual problems cited by Steve, RPC makes heavy demands on scalability. Consider the SLA requirements for an RPC service provider. An RPC request must respond in a reasonable time interval – typically a few tens of seconds at most. But what happens when the service is under heavy load? What strategies can we use to ensure a reasonable level of service availability?
- Keep adding capacity to the service so as to maintain the required responsiveness…damn the torpedoes and the budget!
- The client times-out, leaving the request in an indeterminate state. In the worst case, the service may continue working only to return to find the client has given up. Under continued assault, the availability of both the client and the server continues to degrade.
- The service stops accepting requests beyond a given threshold. Clients which have submitted a request are responded to within the SLA. Later clients are out of luck until the traffic drops back to manageable levels. They will need to resubmit later (if it is still relevant).
- The client submits the request to a proxy (such as a message queue) and then carries on with other work. The service responds when it can and hopefully the response is still relevant at that point.
Out of all these coping strategies, it seems that Option 2 is the most common, even when in many cases one of the other strategies is more efficient or cost-effective. Option 1 might be the preferred option given unlimited funds (and ignoring the fact it is often technically infeasible). In the real world Option 2 more often becomes the default.
The best choice depends on what the service consumer represents and what is the cost of any of the side-effects when the service fails to meet its SLA.
When the client is a human – say ordering something at our web site:
- Option 2 means that we get a pissed off user. That may represent a high, medium or low cost to the organization depending on the value of that user. In addition there is the cost of indeterminate requests. What if a request was executed after the client timed-out? There may be a cost of cleaning up or reversing those requests.
- Option 3 means that we also get a pissed off user – with the associated costs. We may lose a lot of potential customers who visit us during the “outage”. On the positive side, we minimise the risk/cost of indeterminate outcomes.
- Option 4 is often acceptable to users – they know we have received their request and are happy to wait for a notification in the future. But there are some situations where immediate gratification is paramount.
On the other hand, if the client is a system participating in a long-running BPM flow, then we have a different cost/benefit equation.
- For Option 2, we don’t have a “pissed off” user. But the transaction times out into an “error bucket” and is left in an indeterminate state. We must spend time and effort (usually costly human effort) to determine where that request got to and remediate that particular process. This can be very costly.
- Option 3 once again has no user impact, and we minimise the risk of indeterminate requests. But what happens to the halted processes? Either they error out and must be restarted – which is expensive. Alternatively they must be queued up in some way – in which case Option 3 becomes equivalent to Option 4.
- In the BPM scenario, option 4 represents the smoothest path. Requests are queued up and acted upon when the service can get to it. All we need is patience and the process will eventually complete without the need for unusual process rollbacks or error handling. If the queue is persistent then we can even handle a complete outage and restoration of the service.
So if I am a service designer planning to handle service capacity constraints, for human clients I would probably choose (in order) Option 3, 4 and consider the costs of option 2. For BPM processes where clients are “machines” then I would prefer Option 4 every time. Why make work for myself handling timeouts?
One problem I see so often is that solution designers go for Option 2 by default – the worst of all the options available to them.
4 comments ↓
Self-comment: apropos this topic is slide 18 out of Evan Weaver’s QCon presentation “Improving Running Components at Twitter“:
“Message Queue purpose in a webapp: Move operations out of the synchronous request cycle. Amortize load over time”
Hi Saul,
Any comments on a few other dimensions to this: namely transactions, retries and compensation.
End to End transactions are very rare, although you do see some point to point ones. This effectively makes retries completely redundant (although everyone still seems to put them in for some reason).
Compensation seems to be the dominant approach in the organisations I’ve seen, although it’s often poorly implemented – and often manual. Done well it can be quite effective, however.
Compensation can be better where a synchronous user-time response is absolutely mandatory. i.e. you can’t respond to the user asynchronously, as per Option 4.
However, it’s not necessarily the best for BPM. A compensating service effectively adds another scenario, which increases the complexity of the process. Generally speaking the process is better at determining what should/shouldn’t be rolled back for a particular case…
Jon
Hi Jon and thanks for the question:
In the bigger picture, this post asks the question “when do I know my business process is dead”? The points you raise are around “what do I do when my process is dead”? The key issue I wanted to raise is that many people prematurely kill their process using timeouts, when it may not be dead in the first place.
Once you kill a process then, yes you need to clean up. Distributed transactions (a la two-phase commit) is one textbook option, but I have never seen it work in practice for anything resembling a “business process”. Compensation is the more common approach.
This is good fodder for another post I think.
[…] transports tie us to a regimen of synchronous request-reply with timeouts which creates very tight couplings between provider and consumer. Even though one-way MEPs were an […]
Leave a Comment