Amazon Autopsy Reveals Causes of Cloud Death
Amazon has apologized to customers affected by last week’s EC2 outage and offered a detailed post mortem about exactly what went wrong. The short answer is that a network update shifted traffic to the wrong router, which then wrecked havoc on Amazon’s US East Region Availability Zone.
In addition to apologizing, Amazon is giving affected customers “a 10 day credit equal to 100 percent of their usage of EBS Volumes, EC2 Instances and RDS database instances that were running in the affected Availability Zone.”
Amazon is also promising to improve its communication with customers when things go wrong, but as we pointed out earlier, the real problem is not necessarily Amazon. While Amazon’s services unquestionably failed, those sites that had a true distributed system in place (e.g. Netflix, SmugMug, SimpleGeo) were not affected.
In the end it depends how you were using EC2. If you were simply using it as a scalable web hosting service, your site went down. If you were using EC2 as a platform to build your own cloud architecture, then your services did not go down. The later is a very complex thing to do, and it’s telling that the sites that survived unaffected were all large companies with entire engineering teams dedicated to creating reliable EC2-based systems.
That may be the real lesson of Amazon’s failure — EC2 is no substitute for quality engineers.