Google Apps Cloud Outage Lessons LearnedBy Chris Talbot | Print
Re-Imagining Linux Platforms to Meet the Needs of Cloud Service Providers
Google's most recent cloud outage caused chaos among Java-based applications on the Google App Engine. It's the most recent cloud outage to grab headlines, and not the first one for Google.
Google has confirmed that the Google App Engine experienced a Java App Engine outage on the evening of July 14, 2011, causing chaos amongst various Java-based applications on Google App Engine for about 4 and a half hours.
The outage began at 7 pm PT, at which point applications affected by the downtime experienced high latency and error rates. According to Google, approximately 1.9 percent of App Engine traffic was affected at peak. On the Google Developer Blog, the App Engine team noted that the outage began not too long after a scheduled maintenance period, but Google assured developers that the scheduled maintenance and unexpected outage were unrelated.
The service outage "gradually increased in magnitude over time" before Google engineers were dispatched to deal with the problem. It took Google’s engineers 2.5 hours (9:30 pm PT) to get started on making repairs to the Java App Engine, at first with the intention of reducing the impact of the outage.
The Java element of Google App Engine wasn’t fully back online until 11:30 pm PT, at which point all Java App Engine applications had been restored to normal operations. Google apologized for the outage and promised to look at its procedures to improve performance in the future.
"Overall reliability, quick return to service, and fast, accurate communication to our customers are some of the core goals of Google App Engine's service offering. While we restored service relatively quickly, it's clear to us that we fell short in prompt communication of status updates," posted Wesley Chun, a member of the Google App Engine team, to the blog.
Currently, the team is still investigating the causes of the outage, but the blog post noted that it has a preliminary understanding of what happened to cause the Java outage. More information is promised one the investigation has been completed.
This isn’t the first time that Google’s platform for developing cloud applications in its managed data centers has experienced an outage (an unfortunate reality in a business environment where cloud computing service providers are dodging accusations of unreliability).
Here’s a quick (and incomplete) history lesson in Google App Engine’s failures in recent years:
On February 24, 2010, Google App Engine applications experienced degraded operational states for varying amounts of time (from 20 minutes to two hours) between 7:48 am and 10:09 pm PT. The cause? A power failure in the primary data center that engineers said was an issue that had been planned for but not everyone on staff was aware of the processes.
On July 2, 2009, the outage that occurred between 6:45 am PT and 12:35 pm PT caused varying degrees of chaos with Google App Engine applications – from partial to complete applications outages. The cause of the outage was a bug on the GFS Master server that Google stated was caused by another client in the data center. An improperly formed file handle hadn’t been sanitized by the systems on the server side and caused a stack overflow when it was processed. Google later discovered the bug had been live for at least a year.
On June 17, 2008, Google App Engine was hit by a datastore outage at 6:30 am PT. According to Google, only a small number of requests were returned as errors, but the number of errors continued to increase throughout the morning until engineers isolated the incident at 1:40 pm PT. Problem solved, and another bug (this one affecting datastore servers) was found and dealt with.
In other areas of Google cloud computing, the company has often had to deal with surly customers complaining about Gmail or Google Apps outages, but it’s hardly a new tale in the realm of cloud. Google certainly isn’t the only cloud computing service provider that experiences its share of outages, and fingers can easily pointed towards unreliability in a variety of directions.