Amazon apologizes for cloud outage, promises to do better

Email LinkedIn
Tools

Amazon has revealed that the lengthy outage at its data center in Northern Virginia was the result of a network configuration error when performing a network capacity upgrade. It caused Amazon's primary and secondary systems there to fail and overload, setting off a chain of events that culminated in a "re-mirroring storm," as nodes searched repeatedly for resources that simply weren't available.  

Amazon was forced to stop and restart systems to recover effected systems, which were locked in a race condition. Moreover, the company had to physically move "a large amount" of additional new server capacity from the U.S. East Region and install them into the degraded cluster. Even incorporating the new capacity was painstaking work in order to ensure that the new capacity doesn't inundated old servers with requests that cannont be fulfilled.

The problem is complex and not easily understood by professionals who are unfamiliar with Amazon's Elastic Compute Cloud; the detailed technical explanation can be found here. In a nutshell: Human error brought the data center into an unexpected state from which it was unable to automatically recover from. Apologizing for the outage, Amazon says it will be giving users a 10-day cloud services credit--even if they were not affected by the outage.

For more on this story:
- check out this article at Datacenter Knowledge
- check out this article at BBC

Related Articles:
Questions raised over latest Amazon EC2 outage
Cloud risks put in spotlight by Amazon outage 
Amazon Web Services now does email