Lessons From a Cloud Failure: It’s Not Amazon, It’s You

Chaos Monkey will eat your cloud.

Amazon’s cloud-hosted Web Services experienced a catastrophic failure last week, knocking hundreds of sites off the web. Some developers saw the AWS outage as a warning about what happens when we rely too much on the cloud. But the real failure of Amazon’s downtime is not AWS, but the sites that use it.

The problem for those sites that were brought down by the AWS outage is the sites’ own failure to implement the one key design principle of the cloud: Design with failure in mind.

That’s not to say that Amazon didn’t fail rather spectacularly, taking out huge sites like Quora, Reddit, FourSquare and Everyblock, but as Paul Smith of Everyblock admits, while Amazon bears some of the responsibility, Everyblock failed as well:

Frankly, we screwed up. AWS explicitly advises that developers should design a site’s architecture so that it is resilient to occasional failures and outages such as what occurred yesterday, and we did not follow that advice

But perhaps the most instructive lesson comes from those sites that were not affected, notably Netflix, SimpleGeo and SmugMug. Netflix published a look at how it uses AWS last year and, by all appearances, those lessons served the company well, because Netflix remained unaffected by the recent failure.

Among Netflix’s suggestions is to always design for failure: “We’ve sometimes referred to the Netflix software architecture in AWS as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own.”

To ensure that each system can stand on its own, Netflix uses something it calls the Chaos Monkey (no relation). The Chaos Monkey is a set of scripts that run through Netflix’s AWS process and randomly shuts them down to ensure that the rest of the system is able to keep running. Think of it as a system where the parts are greater than the whole.

The photo sharing site SmugMug has also detailed its approach to designing for failure and why SmugMug was largely unaffected by the recent AWS outage. SmugMug co-founder and CEO Don MacAskill, echos Netflix’s redundancy mantra, writing, “Each component (EC2 instance, etc.) should be able to die without affecting the whole system as much as possible. Your product or design may make that hard or impossible to do 100 percent — but I promise large portions of your system can be designed that way.”

MacAskill also has strong words for those who think the recent AWS outage is a good argument for sticking with your own data center: “[SmugMug's] data-center related outages have all been far worse … we’re working hard to get our remaining services out of our control and into Amazon’s.”

“Cloud computing is just a tool,” writes MacAskill, “Some companies, like Netflix and SimpleGeo, likely understand the tool better.”

If you’d like to learn more about how designing for cloud services differs from traditional data-center setups, check out this excellent post on O’Reilly. Also, be sure to read Netflix’s advice and learn from Everyblock’s downtime by following the guidelines in Amazon’s own documentation.

Photo: Technically not a monkey. (DBoy/Flickr/CC)

See Also: