Christmas Eve Netflix outage traced to mistake by Amazon

Tools

A Netflix outage that saw customers in the United States, Canada and Latin America faced with time-outs and delays on Christmas Eve was traced to a mistake by an engineer at Amazon (NASDAQ: AMZN) Web Services. The engineer was running a maintenance process and accidentally directed it at state data used to control the operations of AWS Elastic Load Balancers.

In this case, the erased data contained instructions on how Netflix video traffic should be directed by the ELBs. Given that many requests for media were still being processed normally, it took the team some time to determine the cause of the problem. Once the problem was identified, the engineers worked through the night to restore the missing ELB state data from an earlier backup.

As reported by The Register, an estimated 6.8 percent of the AWS ELBs were out at the peak of the debacle.  

The incident culminated in Amazon placing restrictions on accessing the state data, as well as enhancing its data recovery process. The latter was the result of separate bungles that took place as engineers scrambled to get the service back up.

Human error has been the cause of some pretty spectacular outages in the past. A misconfiguration of load-balancing servers at Google (NASDAQ: GOOG) brought down Gmail for 40 minutes in December last year, while every search from Google was wrongly tagged with a "This site may harm your computer" message for an hour in 2009.

For more:
- check out this article at The Register

Related Articles:
Human error knocks Gmail offline
Google breaks the Internet, accidentally