Amazon: Sorry For Netflix Downtime, Here's What We Got Wrong
Amazon has publicly apologized for the outage that stopped Netflix users from spending Christmas Eve slumped in front of How It's Made re-runs while slurping egg nog, blaming human error for the server downtime. According to Amazon, a developer inadvertently deleted part of the "ELB state data" which handles load balancing – which servers deliver content to each user across different locations – and it took several hours of testing and troubleshooting to figure out what had gone wrong.
"The service disruption began at 12:24 PM PST on December 24th when a portion of the ELB state data was logically deleted. This data is used and maintained by the ELB control plane to manage the configuration of the ELB load balancers in the region (for example tracking all the backend hosts to which traffic should be routed by each load balancer). The data was deleted by a maintenance process that was inadvertently run against the production ELB state data. This process was run by one of a very small number of developers who have access to this production environment. Unfortunately, the developer did not realize the mistake at the time. After this data was deleted, the ELB control plane began experiencing high latency and error rates for API calls to manage ELB load balancers" Amazon
Unfortunately, the initial efforts to take a snapshot of the system configurations prior to the accidental deletion – a process which took several hours – did not work. A second method was cooked up, which was more successful; however, installing it and bringing all of the systems back online was not so straightforward as simply overwriting the patchy section of data.
Instead, Amazon's AWS team had to merge the new ELB state data with the old – a process which took almost three hours alone – and then spent a further five hours gradually re-enabling all of the service workflows and APIs in a way which did not affect any correctly running process. Amazon says the system was operating normally by 12:05PM PST.
"Last, but certainly not least, we want to apologize. We know how critical our services are to our customers' businesses, and we know this disruption came at an inopportune time for some of our customers. We will do everything we can to learn from this event and use it to drive further improvement in the ELB service" Amazon
As well as the apology, Amazon says it has implemented new policies to make sure the same problem doesn't happen again. The ELB state data is now harder to delete without specific approval, rather than under blanket permissions for the small number of developers with access, and Amazon has updated its data recovery policies with the new skills it was forced to learn. "We are confident that we could recover ELB state data in a similar event significantly faster (if necessary) for any future operational event" the company's data team says.
In fact, Amazon plans to make some lemonade from the Christmas Eve lemons, building new server systems that can automatically recover data rather than wait for human intervention. "We believe that we can reprogram our ELB control plane workflows to more thoughtfully reconcile the central service data with the current load balancer state" the AWS team suggests. "This would allow the service to recover automatically from logical data loss or corruption without needing manual data restoration."
[via Bloomberg]