Obviously every distributed system is different and every outage is unique so it is difficult to generalise. Some takeways I have are:
- Outages happen to even the best guys on the block…so you better plan for yours.
- Building distributed systems is hard…so you need experience and experienced friends.
- Manual changes are a common cause…not said explicitly in the AWS writeup, but strongly implied.
- Outages are often “emergent” phenomena whereby a simple error causes many systems to interact in a way which grows exponentially. The AWS writeup refers to this as a “storm” and I have witnessed similar “storms” in large distributed systems. The degree of coupling and simple aspects like backoff parameters can make the difference between a disturbance that grows exponentially or decays exponentially. Think of the Tacoma Narrows bridge – perhaps the analogy is a stretch, but tuning of a few simple parameters can avoid destructive resonances.
- One of the responses pointed to the Netflix Chaos Monkey as being vindicated by the outage. The “Lean” guys have taught us that if something is difficult (like testing or deployment) then you should do it often until it aint difficult any more. Perhaps system failure/resilience is the next frontier for this approach.