March 2, 2017
Less than 48 hours after the failure, AWS has published a detailed analysis of what went wrong. As we'd hoped, the analysis is extremely transparent, direct, and outlines the actions AWS is taking to mitigate the risk of future failure. If you don't have time to read the details, here's the lowdown:
- As we expected (see below), the inititating event was a human error. An authorized admin executed a script to take some parts of S3 off line, but took more than needed off line at once. The user was authorized and the script worked, but it should have had additional safety checks (limits).
- Restarting such a large subsystem took longer than expected. A restart of this magnitude had not been tested recently. As a key part of the S3 system in the affected region, the restart delay caused the S3 APIs to become unavailable.
- The AWS Service Health Dashboard admin console could not be updated because it, too, depended on S3 in the affected region.
What It Means:
Our original advice for AWS customers below stands: check your apps for dependence on a single S3 region. In addition,
- Test your operational scripts. Do you have a maintenance script you have't run for a while? Check is now for limits.
- Isolate your monitoring tools from your production systems. You can't monitor a system from the inside while it's failing.
- Think big in your DR and availability planning. Test a larger failure than usual. Rare events over a long enough period of time…are no longer rare.
OUR ORIGINAL TAKE:
On February 28, 2017, many internet sites and business services suffered disruptions. Some early speculation that it was another hack like the October DDOS attack on Dyn. The reality was less nefarious but nonetheless far-reaching. Amazon’s AWS unit soon revealed a failure in its S3 storage service in a particular region (Eastern US). Just about anyone who uses AWS has a dependency tied to S3. My colleagues (prominently Paul Miller) and I spent the day fielding dozens of inquiries from clients and reporters about the impact of this outage.
Our position can be briefly summarized as, “Wake up, but don’t panic!” The major public cloud provider platforms are incredibly reliable, but they all have failures. This was one of those, it was widely felt, and it was important – but it changes nothing about the viability or future prospects of public cloud.
Here is the longer version:
While the true root cause has not yet been disclosed, this particular incident highlights some aspects of business technology – especially in the cloud – that all companies should understand:
· Technology breaks – get accustomed to that. No tech platform is bulletproof. Cloud services do come close because they employ redundancy in their services, but they do indeed fail. 99.99% availability target means at least an hour a year of failure is expected. Don’t count on any layer of your technology stack to be there all the time. Expect it to fail and design accordingly. Then test. Then test again.
· AWS remains an exemplar of dependable design. AWS continually proves itself among the most innovative companies in the tech world. Its data centers, hardware, software, security, and overall philosophy are rock-solid. With several trillions of data objects under management, the track record of S3 is remarkable. The typical on-premises data center is orders of magnitude more fragile.
· Other cloud services are equally as vulnerable. Don’t rush into the arms of Microsoft, Google, or IBM based on this event alone – that’s not a rational response. It’s certainly rational to explore the object storage offerings at AWS’s competitors regularly, though, and we encourage that. In fact, other cloud services also fail and also make the news. Because AWS commands such a large share of the market, it tends to make bigger news.
· Dependability is in YOUR hands, not those of your suppliers. The bottom line here is that your providers – regardless of who they are – are not responsible for your business or your storage resiliency strategy. You are! Like any technology stack, you must choose the cloud materials you use, design solid apps around them, assemble them, and maintain the assembly. The assembly is where you attain true resilience and how you deliver dependability to your customers. Note that S3 was down in one region – not all – and many customer apps handled the failure well.
· Tight dependencies are fragile. One main principle of good systems design is to reduce rigid dependencies. Design your code to adapt to such failure, and your code is less likely to fail. Newer software architecture methods and tools allow modern developers to loosen the coupling between software components so failures are less likely to cascade through the dependency chain.
· Human error is your bigger enemy. We don’t yet know precisely why S3 failed, but our best guess is either a software glitch or human error. Most technology failures are the result of smart people doing unfortunate things. Every public cloud provider employs an extreme amount of automation to eliminate human error, but, like storage services, that automation is not foolproof.
· Cloud is not the untamed frontier. Many cloud critics have already pounced on this failure as vindication of their anti-cloud positions. We respectfully disagree. Significant business is already in production in public cloud because these services keep getting better, faster, broader, and yes, safer. And the value keeps expanding, proven by the rapid growth in cloud service spending. Cloud is not only here to stay; it will be the platform for much more of your technology in the future.
Do hold your suppliers accountable for their performance. This includes AWS and other cloud providers. We’ll be staying close to this situation and share what we know about root cause as soon as we can verify. For now, assume failures will happen. Build around them. Netflix has demonstrated this principle beautifully at incredible scale with its Chaos Monkey approach. Chaos Monkey navigates the Netflix systems and intentionally wreaks havoc (chaos). When things break, Netflix learns, redesigns, and starts breaking things again.
This is not in any way a “blame the user” post. AWS owns the failure and the customer experience problems it caused. AWS customers should keep the pressure on for a root cause description and a strategy to avoid future S3 problems. In parallel, all cloud customers should take this opportunity to check their own apps for any single point of failure.