Any Lessons from the AWS Outage?

You’ve probably heard that Amazon AWS had some problems recently. A question on Stackoverflow recently pointed out a detailed summary of the problem posted on the AWS message board.

Obviously every distributed system is different and every outage is unique so it is difficult to generalise. Some takeways I have are:

  1. Outages happen to even the best guys on the block…so you better plan for yours.
  2. Building distributed systems is hard…so you need experience and experienced friends.
  3. Manual changes are a common cause…not said explicitly in the AWS writeup, but strongly implied.
  4. Outages are often “emergent” phenomena whereby a simple error causes many systems to interact in a way which grows exponentially. The AWS writeup refers to this as a “storm” and I have witnessed similar “storms” in large distributed systems. The degree of coupling and simple aspects like backoff parameters can make the difference between a disturbance that grows exponentially or decays exponentially. Think of the Tacoma Narrows bridge – perhaps the analogy is a stretch, but tuning of a few simple parameters can avoid destructive resonances.
  5. One of the responses pointed to the Netflix Chaos Monkey as being vindicated by the outage. The “Lean” guys have taught us that if something is difficult (like testing or deployment) then you should do it often until it aint difficult any more. Perhaps system failure/resilience is the next frontier for this approach.


#1 FirstCandelaria on 11.22.17 at 4:04 am

I see you don’t monetize your blog, don’t waste your traffic, you can earn extra cash every month because
you’ve got high quality content. If you want to know
how to make extra $$$, search for: Boorfe’s tips best adsense alternative

#2 KattieSmall on 11.28.17 at 7:18 pm

I have checked your website and i’ve found some duplicate content, that’s
why you don’t rank high in google, but there is a tool that
can help you to create 100% unique content, search
for: Boorfe’s tips unlimited content

#3 Boschzvx on 12.14.20 at 6:08 pm

term manuscript (late lat.manuscriptum,

#4 Seriesnww on 01.01.21 at 8:08 am

ancient and medieval Latin,

#5 Interfaceevu on 01.06.21 at 7:58 am

the spread of parchment.

#6 Weaponahi on 02.03.21 at 4:04 am

number of surviving European

#7 Flukebyu on 02.12.21 at 1:24 am

which is carried out by the printing

Leave a Comment