A week after Amazon Web Services suffered an outage of its Elastic Compute Cloud aka EC2, Amazon finally came forward with an explanation of what happened, an apology to the many companies that were affected, and information about what Amazon plans to do to prevent such events in the future.
The apology and explanation included news that Amazon would provide a 10 day credit to customers affected by the outage.
Amazon told customers that the trigger for the event was a network configuration change incorrectly implemented that led to a re-mirroring storm. Amazon said that it now understands the amount of capacity needed for large recovery events — something that was not in place before the outage.
"We have already increased our capacity buffer significantly, and expect to have the requisite new capacity in place in a few weeks," the company said in the message posted to customers. "We will also modify our retry logic in the EBS server nodes to prevent a cluster from getting into a re-mirroring storm."
Amazon also offered details on how it would improve its recovery process as well to increase the speed fixing any future outages.
"we want to apologize," the message said. "We know how critical our services are to our customers’ businesses and we will do everything we can to learn from this event and use it to drive improvement across our services. As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes."