Amazon Web Services hit by power outage
Amazon Web Services suffered a power outage at its Northern Virginia data center on Thursday evening last week, briefly bringing down one availability zone in the US-EAST-1 region. According to InformationWeek, some of the companies affected by the power outage include Heroku, Pinterest and Quora.
Though control over 99 percent of affected systems was returned to customers, it was at least seven hours before some services such as Heroku announced the complete restoration of access.
According to a root-cause analysis of the outage, the problem was said to have originated from a cable fault from a high-voltage utility power distribution system. Two utility substations went offline, causing a backup generator to take over. Unfortunately, one of the generators overheated due to a defective cooling fan, causing EC2 instances and EBS volumes that rely on this generator to fall back on backup power from a separate power network. An improperly configured circuit breaker then tripped, causing the affected EC2 instances and EBS volumes to be "left without primary, back-up or secondary back-up power."
It must be noted that the outage is comparatively minor given the data centers and availability zones around the world. While Amazon (NASDAQ: AMZN) did suffer a failure that spanned more than one availability zone once last year, the takeaway here is that companies that truly value uptime must architect their system across availability zones or even across geographical regions.
Indeed, Amazon once again urged businesses to implement their solutions using more than one availability zone for greater fault resilience. The company noted: "Those customers with affected instances or volumes that were running in multi-Availability Zone configurations avoided meaningful disruption to their applications."