Entries Tagged 'distributed-computing' ↓

Any Lessons from the AWS Outage?

You’ve probably heard that Amazon AWS had some problems recently. A question on Stackoverflow recently pointed out a detailed summary of the problem posted on the AWS message board.

Obviously every distributed system is different and every outage is unique so it is difficult to generalise. Some takeways I have are:

  1. Outages happen to even the best guys on the block…so you better plan for yours.
  2. Building distributed systems is hard…so you need experience and experienced friends.
  3. Manual changes are a common cause…not said explicitly in the AWS writeup, but strongly implied.
  4. Outages are often “emergent” phenomena whereby a simple error causes many systems to interact in a way which grows exponentially. The AWS writeup refers to this as a “storm” and I have witnessed similar “storms” in large distributed systems. The degree of coupling and simple aspects like backoff parameters can make the difference between a disturbance that grows exponentially or decays exponentially. Think of the Tacoma Narrows bridge – perhaps the analogy is a stretch, but tuning of a few simple parameters can avoid destructive resonances.
  5. One of the responses pointed to the Netflix Chaos Monkey as being vindicated by the outage. The “Lean” guys have taught us that if something is difficult (like testing or deployment) then you should do it often until it aint difficult any more. Perhaps system failure/resilience is the next frontier for this approach.

Characterising Architectural Styles

I’ve recently been thinking about simple ways to characterise the different architectural approaches we use in distributed systems today. Three simple architectural characteristics I have come up with are:

  • Asset – this is a core capability or “thing” that must be built, procured, maintained and managed in a corporate IT “inventory”.
  • Element – these are the “atomic” building blocks used in the process of Composition.
  • Composition – this is the mechanism which allows the different assets to work together to support a business requirement.

If we apply these characteristics to three common architectural approaches then we get the results in the following table.


EAI, although  unfashionable is still prevalent – even dominant in the industry. For EAI, we are primarily concerned with applications (usually COTS) which embody and support the requirements of different parts of the business. Multiple applications must be coordinated to support the whole business. The primary coordination mechanism under EAI is synchronization of state between the different applications – primarily via data integration. The composition element is the API.

SOA is characterised by Services as the key asset. Services are acquired or built to execute business operations. Elemental Services are composed to support business processes – a sequence of operations which results in a business outcome. BPM (or process orchestration) is the composition mechanism within SOA. The underlying functionality of a Service may reside in one or more applications, but from an SOA perspective this is of secondary importance…SOA is concerned with the Service, not necessarily its implementation.

EDA (Event Driven Architecture) is characterised by Event Services as the key asset which represent an asynchronous notification of an important event associated with the business. Elemental Events are correlated and further processed to derive higher-order business intelligence which may in turn trigger other Events. The primary composition mechanism within EDA is Event Processing (or CEP). An important part of this composition mechanism is the ability to manage or track system state.

This characterization gives more clarity to the difference between JABOWS and true SOA. Many so-called SOA projects have simply involved the “bottom up” exposure of application APIs using web-services standards – resulting in “Just a Bunch of Web Services” which don’t realise the business value of a true SOA. JABOWS is EAI because the applications are the core “asset”. A true SOA has a “top down” process-centric architecture with Services as the core asset.

Dimensions of Coupling

Coupling is one of the most fundamental measures of “quality” for an information system. The concepts of coupling and cohesion appear in software design best practices for at least a couple of decades. And these concepts are also vital to the development of distributed systems. As core as the concept of coupling is, it is difficult to find a real definition in the distributed systems context. Coupling is like obscenity – we can’t define it, but we know it when we see it.

Which is why I was pleased to see Ian Robinson’s post which presented coupling as lying on two dimensions – temporal and behavioural and even put in place some characteristics which helps you put a rough measure on the degree of coupling. Coincidently, I had drafted my own version of this some time ago, but it had never made it to publication.

Like Ian, I was trying to quantify coupling so that we can understand what constitutes a tightly or a loosely coupled system and we can have some approach to measure it and therefore have a method to decide between design trade-offs in satisfying the various requirements of our distributed systems. While Ian presents a conceptually clean two-dimensional picture, I felt the true story involves multiple interacting dimensions.

While I was researching this, I happened to find a book extract which covers what I wanted to say and more. The full extract is well worth reading, but is summarised in the following table:

Level Tight Coupling Loose Coupling
Physical coupling Direct physical link required Physical intermediary
Communication style Synchronous Asynchronous
Type system Strong type system (e.g., interface semantics) Weak type system (e.g., payload semantics)
Interaction pattern OO-style navigation of complex object trees Data-centric, self-contained messages
Control of process logic Central control of process logic Distributed logic components
Service discovery and binding Statically bound services Dynamically bound services
Platform dependencies Strong OS and programming language dependencies OS- and programming language independent

Here we have no less than seven dimensions to the coupling equation.

The final paragraph of this article highlights the costs of loose-coupling (and only some of the benefits).

However, in most cases, the increased flexibility achieved through loose coupling comes at a price, due to the increased complexity of the system. Additional efforts for development and higher skills are required to apply the more sophisticated concepts of loosely coupled systems. Furthermore, costly products such as queuing systems are required. However, loose coupling will pay off in the long term if the coupled systems must be rearranged quite frequently.

I think this understates the benefit. “Rearranged frequently” seems to only cover design-changes. But it should also cover “runtime rearrangement” such as partitioning across redundant components for the purpose of load-balancing and fault-tolerance. In such cases, “loose-coupling” provides significant value in higher uptime and scalability of distributed systems.

Update May 14, 2009: Richard Veryard has pointed me to his paper “Component Based Service Engineering” (subscription required) which discusses an even wider range of coupling beyond the technical layers into Process, Organizational and Business layers. The CBDi Wiki has a table summarizing all the coupling dimensions identified in Richard’s paper.

One section of the paper struck a chord with me:

How can we have loose coupling and hard-wiring at the same time? The answer comes as soon as we recognize that coupling is multidimensional or multilayered. My head is connected (coupled) to the rest of my body in several different ways. Even if I could introduce some technology to decouple the nervous system, that doesn’t allow me to remove my head….With Web Services, SOAP simply removes one set of the hard-wired connections. Other forms of coupling remain.

This was written in 2003 and proves quite prescient in that many SOA projects in the interim have failed to achieve their goals by simply adopting out-of-the-box “web services” which only address one or two of the many dimensions of coupling.

The Architectural Role of Messaging

JMS has brought messaging more into the mainstream which is a good thing. But just like any new technology there is the danger that the first implementations will reflect older paradigms. I remember when I made the move from FORTRAN to C, for a while I wrote a lot of FORTRAN programs in the C language syntax until I got more familiar with C features and idioms. The same goes for my more recent ventures into Ruby with many years of Java thinking under my belt.

Coming back to JMS, I find when I review distributed applications that have been designed by people with a strong client-server or web background I see a lot of rpc message semantics. While rpc (or synchronous request/reply) has it’s place, this is not always the best approach. A common mistake is to regard messaging as simply a way to get messages from point A to point B…treating JMS as a simple transport such as a TCP socket or HTTP.

Messaging originated as the concept of a distributed queue. Most programmers are familiar with queues from GUI frameworks where communications between widgets are mediated via an event queue. The event queue supports a number of functions such as decoupling widgets from each other…allowing each widget to do what it needs to in it’s own time, and supporting event driven interactions between widgets. The event queue along with multi threading is key to giving user interfaces the responsiveness and robustness that you expect. In this way queueing provides more than just a communications mechanism but is key to the architecture of a GUI framework.

The same is true of distributed messaging systems. In their original conception distributed queues do more than just provide a way for data to pass from one system to another, they provide an important element of isolation.

A fundamental difficulty in building distributed systems is that the different components have different performance characteristics. In addition the uptime of your total system is the product of the uptime of individual components. To ensure maximum uptime you want your components to be independent of each other and, in the event of failure you want to be able to restart from where you left off. This is where message queues work really well. Component A puts a message onto a queue and doesn’t  care if or when component B takes that message off the queue. This is known as the fire and forget message pattern and it provides the best isolation between your system components.

If instead we make Component A wait for an acknowledgment from Component B before it proceeds then we are building a tight coupling into the system. Any performance difficulties or failure experienced by Component B could spread back to Component A and thence to other components up the chain.

So the role of messaging in distributed systems goes beyond just getting a message from point A to point B. It also acts as a kind of expansion joint for your system allowing individual components to vary in their performance characteristics – or even fail totally for short periods – without breaking adjacent components.

Without these messaging expansion joints, your system is tightly coupled and prone to system wide failure originating from a single component. Messaging – using the fire and forget pattern – allows these issues to be locally absorbed and managed within normal system operations.

Protocol Buffers cont’d

There’s a pretty heated interesting active discussion going on at Steve Vinoski’s blog re RPC. You know a discussion is going downhill when people start quoting scripture at each other 🙂  Here is my comment.