Entries from May 2009 ↓

The Real Economics of SOA

The title of the InfoQ article, “Economics of Service Orientation” caught my interest, but unfortunately I was left disappointed. I felt that the article missed the mark somewhat and also missed some of the important costs of Service Orientation in its eagerness to concentrate on the benefits.

In my view, the real economics of SOA is about how many times you pay for the same function.

In the world of packaged applications you have very coarse control over the number and bundling of the functions that you have available. If you buy a packaged application, then you typically buy a bundle of functions or capabilities – some of these hopefully will be functions that you need, yet others will be functions that you don’t need.

An example of this is if you buy a billing application which has inbuilt functionality for handling customer contact details. You may not need customer contacts in your billing application because it already exists in your CRM (for example). However, as delivered, the billing application requires its own proprietary customer contact details module because it is incapable of working with a 3rd party’s CRM data. So there are a bunch of costs associated with purchasing a packaged application:

  1. The cost of the functionality that you require.
  2. The cost of the functionality that you don’t require (the vendor charges licence and support fees for the whole package, not just the bits you really want to use).
  3. The infrastructure cost of hosting all this functionality – required and not required.
  4. The cost of integrating the functionality you require into your business processes.
  5. The cost of integrating and potentially mitigating the functionality that you don’t require.

If we consider cost 5) in the context of my example. Even though you have a perfectly good CRM, the billing system requires contact details to be managed within its own application database, so you have to integrate your CRM with the billing application. There may be side-effects of this integration which need to be mitigated. You may need to invent whole new business processes to manage duplication between two or more systems and the inevitable synchronization problems that arise.

In an SOA utopia, there is more fine grained control over the functions that you acquire from a vendor. In particular you can buy only the functions that you need and then have them plug into other services in your information infrastructure. This saves a number of costs:

  1. You only pay for the function that you require
  2. You don’t pay for the functions that you don’t require
  3. You only pay to host the functionality you have bought – not a bunch of tag-alongs. And if your function is remotely hosted, the costs of hosting the new functionality may be very low.
  4. You pay only the cost of integrating the required functionality into your business processes
  5. You don’t have to pay to integrate and mitigate functionality you don’t require

So the economic value of SOA is in the finer granularity with which you can manage the cost of your IT functionality. The job of the enterprise architect moves from managing a portfolio of applications – complete with duplicates and overlaps – to one of managing services or functions, with hopefully minimal duplicates and overlaps.

But this picture wouldn’t be complete without considering the additional overheads required for a well functioning SOA. Architectural standards and governance are required to ensure that you buy the right services at the right time and that the services can be integrated properly into your business processes. You should then add in costs associated with:

  • There is a cost for running an enterprise architecture function – managing standards and governance.
  • The cost of acquiring a specific function may be higher because of the additional planning and research required to get the “right” function for your business processes and to fit into your enterprise services framework.
  • The integration cost of exposing functions as services is generally going to be higher than just building a point integration. This is because of the greater degree of planning involved with providing a service in a reliable and re-usable manner.

But hopefully the additional cost of architectural governance and standards should be offset by the savings in managing services as opposed to packaged applications.

As a final note, it’s interesting to note that the economic model is similar for the vendors as it is for their customers. This is why many packaged application vendors are hopping on the bandwagon of SOA. These vendors have acquired applications with overlapping functionality and require a more fine-grained approach for managing and selling their functionality.

Dimensions of Coupling

Coupling is one of the most fundamental measures of “quality” for an information system. The concepts of coupling and cohesion appear in software design best practices for at least a couple of decades. And these concepts are also vital to the development of distributed systems. As core as the concept of coupling is, it is difficult to find a real definition in the distributed systems context. Coupling is like obscenity – we can’t define it, but we know it when we see it.

Which is why I was pleased to see Ian Robinson’s post which presented coupling as lying on two dimensions – temporal and behavioural and even put in place some characteristics which helps you put a rough measure on the degree of coupling. Coincidently, I had drafted my own version of this some time ago, but it had never made it to publication.

Like Ian, I was trying to quantify coupling so that we can understand what constitutes a tightly or a loosely coupled system and we can have some approach to measure it and therefore have a method to decide between design trade-offs in satisfying the various requirements of our distributed systems. While Ian presents a conceptually clean two-dimensional picture, I felt the true story involves multiple interacting dimensions.

While I was researching this, I happened to find a book extract which covers what I wanted to say and more. The full extract is well worth reading, but is summarised in the following table:

Level Tight Coupling Loose Coupling
Physical coupling Direct physical link required Physical intermediary
Communication style Synchronous Asynchronous
Type system Strong type system (e.g., interface semantics) Weak type system (e.g., payload semantics)
Interaction pattern OO-style navigation of complex object trees Data-centric, self-contained messages
Control of process logic Central control of process logic Distributed logic components
Service discovery and binding Statically bound services Dynamically bound services
Platform dependencies Strong OS and programming language dependencies OS- and programming language independent

Here we have no less than seven dimensions to the coupling equation.

The final paragraph of this article highlights the costs of loose-coupling (and only some of the benefits).

However, in most cases, the increased flexibility achieved through loose coupling comes at a price, due to the increased complexity of the system. Additional efforts for development and higher skills are required to apply the more sophisticated concepts of loosely coupled systems. Furthermore, costly products such as queuing systems are required. However, loose coupling will pay off in the long term if the coupled systems must be rearranged quite frequently.

I think this understates the benefit. “Rearranged frequently” seems to only cover design-changes. But it should also cover “runtime rearrangement” such as partitioning across redundant components for the purpose of load-balancing and fault-tolerance. In such cases, “loose-coupling” provides significant value in higher uptime and scalability of distributed systems.

Update May 14, 2009: Richard Veryard has pointed me to his paper “Component Based Service Engineering” (subscription required) which discusses an even wider range of coupling beyond the technical layers into Process, Organizational and Business layers. The CBDi Wiki has a table summarizing all the coupling dimensions identified in Richard’s paper.

One section of the paper struck a chord with me:

How can we have loose coupling and hard-wiring at the same time? The answer comes as soon as we recognize that coupling is multidimensional or multilayered. My head is connected (coupled) to the rest of my body in several different ways. Even if I could introduce some technology to decouple the nervous system, that doesn’t allow me to remove my head….With Web Services, SOAP simply removes one set of the hard-wired connections. Other forms of coupling remain.

This was written in 2003 and proves quite prescient in that many SOA projects in the interim have failed to achieve their goals by simply adopting out-of-the-box “web services” which only address one or two of the many dimensions of coupling.

The Power of Later

Steve Vinoski nominates RPC as an historically bad idea, yet the synchronous request reply message pattern is undeniably the most common pattern out there in our client-server world. Steve puts this down to “convenience” but I actually think it goes deeper than that. Because RPC is actually not convenient – it causes an awful lot of problems.

In addition to the usual problems cited by Steve, RPC makes heavy demands on scalability. Consider the SLA requirements for an RPC service provider. An RPC request must respond in a reasonable time interval – typically a few tens of seconds at most. But what happens when the service is under heavy load? What strategies can we use to ensure a reasonable level of service availability?

  1. Keep adding capacity to the service so as to maintain the required responsiveness…damn the torpedoes and the budget!
  2. The client times-out, leaving the request in an indeterminate state. In the worst case, the service may continue working only to return to find the client has given up. Under continued assault, the availability of both the client and the server continues to degrade.
  3. The service stops accepting requests beyond a given threshold. Clients which have submitted a request are responded to within the SLA. Later clients are out of luck until the traffic drops back to manageable levels. They will need to resubmit later (if it is still relevant).
  4. The client submits the request to a proxy (such as a message queue) and then carries on with other work. The service responds when it can and hopefully the response is still relevant at that point.

Out of all these coping strategies, it seems that Option 2 is the most common, even when in many cases one of the other strategies is more efficient or cost-effective. Option 1 might be the preferred option given unlimited funds (and ignoring the fact it is often technically infeasible). In the real world Option 2 more often becomes the default.

The best choice depends on what the service consumer represents and what is the cost of any of the side-effects when the service fails to meet its SLA.

When the client is a human – say ordering something at our web site:

  • Option 2 means that we get a pissed off user. That may represent a high, medium or low cost to the organization depending on the value of that user. In addition there is the cost of indeterminate requests. What if a request was executed after the client timed-out? There may be a cost of cleaning up or reversing those requests.
  • Option 3 means that we also get a pissed off user – with the associated costs. We may lose a lot of potential customers who visit us during the “outage”. On the positive side, we minimise the risk/cost of indeterminate outcomes.
  • Option 4 is often acceptable to users – they know we have received their request and are happy to wait for a notification in the future. But there are some situations where immediate gratification is paramount.

On the other hand, if the client is a system participating in a long-running BPM flow, then we have a different cost/benefit equation.

  • For Option 2, we don’t have a “pissed off” user. But the transaction times out into an “error bucket” and is left in an indeterminate state. We must spend time and effort (usually costly human effort) to determine where that request got to and remediate that particular process. This can be very costly.
  • Option 3  once again has no user impact, and we minimise the risk of indeterminate requests. But what happens to the halted processes? Either they error out and must be restarted – which is expensive. Alternatively they must be queued up in some way – in which case Option 3 becomes equivalent to Option 4.
  • In the BPM scenario, option 4 represents the smoothest path. Requests are queued up and acted upon when the service can get to it. All we need is patience and the process will eventually complete without the need for unusual process rollbacks or error handling. If the queue is persistent then we can even handle a complete outage and restoration of the service.

So if I am a service designer planning to handle service capacity constraints, for human clients I would probably choose (in order) Option 3, 4 and consider the costs of option 2. For BPM processes where clients are “machines” then I would prefer Option 4 every time. Why make work for myself handling timeouts?

One problem I see so often is that solution designers go for Option 2 by default – the worst of all the options available to them.