5/11 Amazon’s Cloud Disaster: What Happened and What Does it Mean?

Last week, Amazon’s cloud service suffered a major disaster.  Several businesses saw their networks and web sites go down for hours or even days.  Amazon brought customers back up in phases, and hasn’t really explained what happened. . .  until now.

In a lengthy statement, Amazon has now described in deep technical detail what happened. Translated for those of us without technical backgrounds, the essence of the explanation is below.

Amazon had multiple disk failures in its datacenter.  When they tried to fix these problems using built-in programs for managing and fixing such issues, their efforts ended up making things worse.  That’s because of the integration of systems across the whole system, with interdependence among all the parts.  The data input and output speeds that were caused by the original disk failures were so extreme that they actually stopped traffic and shut down the datacenter.  Think of it as a domino effect disaster scenario.

To put this into terms a lot more business people may understand from experience with their own in- house networks, Amazon’s drive array crashed and, while it was rebuilding, the system resources were maxed out to a point that accessing the system became impossible.

That, in a nutshell, is what happened at Amazon’s cloud datacenter.

But wait, aren’t cloud solutions supposed to prevent catastrophes like Amazon’s? They are often marketed and hyped as being immune to these kinds of disasters.  Although it is true that cloud architectures provide a lot more tolerance for failure across multiple sources of problems than traditional networks do, no system is immune to failure. Therein is the moral of the story for businesses—small and large.

Always plan for failure, and build in redundancy wherever you can. That’s a nice lesson illustrated by this article.  Some of Amazon’s customers had followed this advice, and did not suffer outages or losses when the Amazon cloud solution crashed.  Others who didn’t plan were not so lucky, and some lost business data permanently.

More Insights