Beautiful Data Polling

O’Reilly have released a new book in their “Beautiful…” series called “Beautiful Data.”  There’s a very comprehensive review on Slashdot which I highly recommend. The description of chapter eight caught my eye:

Chapter Eight is about social data APIs and pushes gnip heavily as the de facto social endpoint aggregator for programmers. The chapter mentions WebHooks as an up and coming HTTP Post event transmission project but doesn’t offer much more than a wake up call for programmers. The traditional polling has dominated web APIs and has lead to fragile points of failure. This chapter is a much needed call for sanity in the insane world of HTTP transactional polling. Unfortunately, the community seems to be so in love with the simplicity of polling that they use it for everything, even when a slightly more complicated eventing model would save them a large percentage of transactions.

The link “fragile points of failure” is worth following as it leads to a robust slashdot discussion on Twitter APIs and polling versus push for the web.

I think for a long time, the “web” as we know it has suffered from the lack of the Event/Listener paradigm. This is a pretty simple design concept that I’m going to refer to as the Observer [wikipedia.org]. Let’s say I want to know what Stephen Hawking is tweeting about and I want to know 24/7. Now if you have to make more than one call, something is wrong. That one call should be a notification to Twitter who I am, where you can contact me and what I want to keep tabs on–be it a keyword or user. So all I should ever have to do is tell Twitter I want to know everything from Stephen Hawking and everything with #stephenhawking or whatever and from that point on, it will try to submit that message to me via any number of technologies. Simple pub/sub [wikipedia.org] message queues could be implemented here to alleviate my need to continually go to Twitter and say: “Has Stephen Hawking said anything new yet? *millisecond pause* Has Stephen Hawking said anything new yet? *millisecond pause* …” ad infinitum.

And yet…

That’s not easy to do on a large scale. A persistent connection has to be in place between publisher and subscriber. Twitter would have to have a huge number of low-traffic connections open. (Hopefully only one per subscriber, not one per publisher/subscriber combination.) Then, on the server side, they’d have to have a routing system to track who’s following what, invert that information, and blast out a message to all followers whenever there was an update. This is all quite feasible, but it’s quite different from the classic HTTP model.

It’s been done before, though. Remember Push technology [wikipedia.org]? That’s what this is. PointCast sent their final news/stock push message [cnet.com] in February 2000. There’s more support for “push” in HTML5, incidentally.

Ahhh yes, I remember PointCast well. One of the early darlings of the dot-com era. This reply points at some new hope:

For messaging architectures (like, say, the internet), the pattern is usually described as “Publish/Subscribe”. All serious messaging protocols support it (XMPP, AMQP, etc.) and some are dedicated to it (PubSubHubbub). The basic problem with using it the whole way to the client is that many clients are run in environments where it is impractical to run a server which makes recieving inbound connections difficult.

There are fairly good solutions to that, mostly involving using a proxy for the client somewhere that can run a server which holds messages, and then having the client call the proxy (rather than the message sources) to get all the pending messages together.

Keep watching.

Dimensions of Coupling

Coupling is one of the most fundamental measures of “quality” for an information system. The concepts of coupling and cohesion appear in software design best practices for at least a couple of decades. And these concepts are also vital to the development of distributed systems. As core as the concept of coupling is, it is difficult to find a real definition in the distributed systems context. Coupling is like obscenity – we can’t define it, but we know it when we see it.

Which is why I was pleased to see Ian Robinson’s post which presented coupling as lying on two dimensions – temporal and behavioural and even put in place some characteristics which helps you put a rough measure on the degree of coupling. Coincidently, I had drafted my own version of this some time ago, but it had never made it to publication.

Like Ian, I was trying to quantify coupling so that we can understand what constitutes a tightly or a loosely coupled system and we can have some approach to measure it and therefore have a method to decide between design trade-offs in satisfying the various requirements of our distributed systems. While Ian presents a conceptually clean two-dimensional picture, I felt the true story involves multiple interacting dimensions.

While I was researching this, I happened to find a book extract which covers what I wanted to say and more. The full extract is well worth reading, but is summarised in the following table:

Level Tight Coupling Loose Coupling
Physical coupling Direct physical link required Physical intermediary
Communication style Synchronous Asynchronous
Type system Strong type system (e.g., interface semantics) Weak type system (e.g., payload semantics)
Interaction pattern OO-style navigation of complex object trees Data-centric, self-contained messages
Control of process logic Central control of process logic Distributed logic components
Service discovery and binding Statically bound services Dynamically bound services
Platform dependencies Strong OS and programming language dependencies OS- and programming language independent

Here we have no less than seven dimensions to the coupling equation.

The final paragraph of this article highlights the costs of loose-coupling (and only some of the benefits).

However, in most cases, the increased flexibility achieved through loose coupling comes at a price, due to the increased complexity of the system. Additional efforts for development and higher skills are required to apply the more sophisticated concepts of loosely coupled systems. Furthermore, costly products such as queuing systems are required. However, loose coupling will pay off in the long term if the coupled systems must be rearranged quite frequently.

I think this understates the benefit. “Rearranged frequently” seems to only cover design-changes. But it should also cover “runtime rearrangement” such as partitioning across redundant components for the purpose of load-balancing and fault-tolerance. In such cases, “loose-coupling” provides significant value in higher uptime and scalability of distributed systems.

Update May 14, 2009: Richard Veryard has pointed me to his paper “Component Based Service Engineering” (subscription required) which discusses an even wider range of coupling beyond the technical layers into Process, Organizational and Business layers. The CBDi Wiki has a table summarizing all the coupling dimensions identified in Richard’s paper.

One section of the paper struck a chord with me:

How can we have loose coupling and hard-wiring at the same time? The answer comes as soon as we recognize that coupling is multidimensional or multilayered. My head is connected (coupled) to the rest of my body in several different ways. Even if I could introduce some technology to decouple the nervous system, that doesn’t allow me to remove my head….With Web Services, SOAP simply removes one set of the hard-wired connections. Other forms of coupling remain.

This was written in 2003 and proves quite prescient in that many SOA projects in the interim have failed to achieve their goals by simply adopting out-of-the-box “web services” which only address one or two of the many dimensions of coupling.

The Power of Later

Steve Vinoski nominates RPC as an historically bad idea, yet the synchronous request reply message pattern is undeniably the most common pattern out there in our client-server world. Steve puts this down to “convenience” but I actually think it goes deeper than that. Because RPC is actually not convenient – it causes an awful lot of problems.

In addition to the usual problems cited by Steve, RPC makes heavy demands on scalability. Consider the SLA requirements for an RPC service provider. An RPC request must respond in a reasonable time interval – typically a few tens of seconds at most. But what happens when the service is under heavy load? What strategies can we use to ensure a reasonable level of service availability?

  1. Keep adding capacity to the service so as to maintain the required responsiveness…damn the torpedoes and the budget!
  2. The client times-out, leaving the request in an indeterminate state. In the worst case, the service may continue working only to return to find the client has given up. Under continued assault, the availability of both the client and the server continues to degrade.
  3. The service stops accepting requests beyond a given threshold. Clients which have submitted a request are responded to within the SLA. Later clients are out of luck until the traffic drops back to manageable levels. They will need to resubmit later (if it is still relevant).
  4. The client submits the request to a proxy (such as a message queue) and then carries on with other work. The service responds when it can and hopefully the response is still relevant at that point.

Out of all these coping strategies, it seems that Option 2 is the most common, even when in many cases one of the other strategies is more efficient or cost-effective. Option 1 might be the preferred option given unlimited funds (and ignoring the fact it is often technically infeasible). In the real world Option 2 more often becomes the default.

The best choice depends on what the service consumer represents and what is the cost of any of the side-effects when the service fails to meet its SLA.

When the client is a human – say ordering something at our web site:

  • Option 2 means that we get a pissed off user. That may represent a high, medium or low cost to the organization depending on the value of that user. In addition there is the cost of indeterminate requests. What if a request was executed after the client timed-out? There may be a cost of cleaning up or reversing those requests.
  • Option 3 means that we also get a pissed off user – with the associated costs. We may lose a lot of potential customers who visit us during the “outage”. On the positive side, we minimise the risk/cost of indeterminate outcomes.
  • Option 4 is often acceptable to users – they know we have received their request and are happy to wait for a notification in the future. But there are some situations where immediate gratification is paramount.

On the other hand, if the client is a system participating in a long-running BPM flow, then we have a different cost/benefit equation.

  • For Option 2, we don’t have a “pissed off” user. But the transaction times out into an “error bucket” and is left in an indeterminate state. We must spend time and effort (usually costly human effort) to determine where that request got to and remediate that particular process. This can be very costly.
  • Option 3  once again has no user impact, and we minimise the risk of indeterminate requests. But what happens to the halted processes? Either they error out and must be restarted – which is expensive. Alternatively they must be queued up in some way – in which case Option 3 becomes equivalent to Option 4.
  • In the BPM scenario, option 4 represents the smoothest path. Requests are queued up and acted upon when the service can get to it. All we need is patience and the process will eventually complete without the need for unusual process rollbacks or error handling. If the queue is persistent then we can even handle a complete outage and restoration of the service.

So if I am a service designer planning to handle service capacity constraints, for human clients I would probably choose (in order) Option 3, 4 and consider the costs of option 2. For BPM processes where clients are “machines” then I would prefer Option 4 every time. Why make work for myself handling timeouts?

One problem I see so often is that solution designers go for Option 2 by default – the worst of all the options available to them.

Ian Robinson on Coupling

In my opinion, coupling is the most fundamental attribute of a system architecture and tight coupling is probably the most common architectural problem I see in distributed systems. The manner in which system components interact can be a chief determinant of the scalability and reliability of the final system.

So I really like Ian Robinson’s post on Temporal and Behavioural Coupling where he uses two coupling dimensions and the inevitable magic quadrant to classify systems based on their degree of temporal and behavioural coupling.

See Ian’s post for the slick professional graphics, but to summarise – event-oriented systems with low coupling  occupy the “virtuous” third quadrant of the matrix. Conversely the brittle “3-tier” applications that many of us struggle with, occupy the “evil” first quadrant where coupling in both dimensions is high.

However I’m a little miffed to see no mention of my favourite “document-oriented message” in Ian’s diagram. As Bill Poole writes; document messages have lower behavioural coupling than command messages, but more than event messages. So would you put document-oriented messages near the middle top of the matrix between command-oriented and event-oriented messages? Unfortunately that would break the symmetry. But it also highlights another problem.

Any type of message – document, command or event-oriented could temporally be tightly or loosely coupled. Temporal coupling is more a property of the message transport than of the message type. So I suggest that the two coupling dimensions are characterised as follows:

  • Temporal coupling – characterised by message transport from RPC (tight coupling) through to MOM (loose coupling).
  • Behavioural coupling – characterised by the message type from event-oriented (tight) through document-oriented to event-oriented (loose).

It so happens that distributed 3-tier systems generally employ both command-oriented messages and RPC transports – hence making them inherently “evil”. Whereas events (being asynchronous)  are naturally virtuous by typically being carried over MOM transports (it’s difficult to request an event notification).

Between heaven and hell, it is in the murky mortal realms of SOA where we need to be constantly mindful of the interactions between message type and transport – lest our system ends up in limbo.

56 Architecture Case Studies

The recent brouhaha about Twitter scalability has highlighted the growth of the latest spectator sport in the blogosphere – “armchair architect”. Everyone’s a Monday morning expert on which language/database/framework is/isn’t the secret to extreme scalability. The real secret is the architecture and organizational maturity. Here are some case studies to prove it:

A Conversation with Werner Vogels where Werner talks about using service-orientation to scale out massively distributed services which power the Amazon e-commerce platform. This is one of my favourites because it covers organizational as well as technical aspects of scalability. One of the unique attributes of Amazon is that service-orientation pervades everything – even their organizational structure. Developers are responsible for running their own services.Werner characterises the adoption of services as a challenging and major learning experience, but it has become one of their main strategic advantages. Key lessons learned:

  • service-orientation is an excellent technique to achieve isolation and high levels of ownership and control
  • prohibiting direct database access allows scaling and reliability improvements without affecting clients
  • a single unified service access mechanism supports service aggregation, routing & tracking
  • service orientation improves development and operational processes leading to more agility
  • giving developers operational responsibility enhances the quality of the services.

The eBay Architecture (PDF) covers the evolution of eBay from 1998 to 2006. It’s a great example of how continuous reinvention is needed to keep up with rapidly growing scaling requirements. Frank Sommers writes a good summary and discussion where he argues that organizational capability is just as important as technical architecture for scalability. Another key ingredient of the eBay story is the ability to discard “conventional wisdom” when required. This is covered in an interview with Dan Pritchett, revealing some of the “rules” that eBay bends in order to scale.

  • eBay.com doesn’t use transactions – mainly for scalability and availability reasons
  • different data is treated in different ways – so best effort suffices for some
  • references the CAP theorem – consistency, availability, partitioning – pick any two
  • many have arrived at the same idea – and transactions are the first to go

Scalable Web Architectures is a great presentation on scalable web architectures in general and Flickr in particular. Also check out Cal Henderson’s list of his other presentations.

Architectures You’ve Always Wondered About provides slides from QCon London 2009 presentations. Case studies about eBay, Second Life, Yahoo!, Linked-In and Orbitz.

Avoiding the Fail Whale is a video in which Robert Scoble interviews architects from FriendFeed, Technorati and iLike.

Improving Running Components at Twitter – Evan Weaver describes how Twitter learned to scale by moving to a messaging architecture.

Real Life Architectures at High Scalability – provides a huge collection of pointers to architecture case studies from around the web.