Human error at Amazon takes down Netflix

Tools

A Netflix outage on Christmas Eve was traced to a mistake by an Amazon (NASDAQ: AMZN) Web Services engineer. The engineer is one of a very small number of developers who have access to this production environment, and whose failure to notice the mistake at that time extended the time it took the team to identify and rectify the problem.

To get a better understanding of what happened, you can read our outline of the outage in "Christmas Eve Netflix outage traced to mistake by Amazon" or the detailed and candid blog post by a member of the AWS team here.

This is not the first time that human error has had such widespread repercussions. That it took place at AWS simply meant that the effect was felt by companies that rely heavily on its cloud infrastructure. In this case, the mistake culminated in technical problems for Netflix, which has built its video-serving infrastructure on top of AWS.

Mind you, the fact that Netflix adopts a multiple zone strategy did not shield it from the Christmas Eve error. Clearly, while economics of scale means that Amazon can design and implement an infrastructure that stands above what a typical business is capable of, its sheer scale also introduces complexities that can be undone by a single human error.

GigaOM put it like this, "More shocking than the AWS outage, though … is that the Christmas Eve outage actually took down Netflix, which is often cited as the most-advanced AWS user around." Specifically, Netflix is known to have a host of homegrown tools built specifically to monitor and manage its AWS-based infrastructure.

Has human error ever caused issues, even despite the best-laid IT plans, in your business? If so, do drop me an email, tweet, or leave a note in the comment section below. - Paul Mah  (Twitter @paulmah)