Google Apps Cloud Outage Lessons Learned

Google
has confirmed that the Google App Engine experienced a Java App Engine outage
on the evening of July 14, 2011, causing chaos amongst various Java-based
applications on Google App Engine for about 4 and a half hours.

The
outage began at 7 pm PT, at which point applications affected by the downtime
experienced high latency and error rates. According to Google, approximately
1.9 percent of App Engine traffic was affected at peak. On the Google Developer
Blog,
the App Engine team noted
that the outage began not too long after a scheduled
maintenance period, but Google assured developers that the scheduled
maintenance and unexpected outage were unrelated.

The
service outage “gradually increased in magnitude over time” before Google
engineers were dispatched to deal with the problem. It took Google’s engineers
2.5 hours (9:30 pm PT) to get started on making repairs to the Java App Engine,
at first with the intention of reducing the impact of the outage.

The
Java element of Google App Engine wasn’t fully back online until 11:30 pm PT,
at which point all Java App Engine applications had been restored to normal
operations. Google apologized for the outage and promised to look at its
procedures to improve performance in the future.

“Overall
reliability, quick return to service, and fast, accurate communication to our
customers are some of the core goals of Google App Engine’s service offering.
While we restored service relatively quickly, it’s clear to us that we fell
short in prompt communication of status updates,” posted Wesley Chun, a member
of the Google App Engine team, to the blog.

Currently,
the team is still investigating the causes of the outage, but the blog post
noted that it has a preliminary understanding of what happened to cause the
Java outage. More information is promised one the investigation has been
completed.

This
isn’t the first time that Google’s platform for developing cloud applications
in its managed data centers has experienced an outage (an unfortunate reality
in a business environment where cloud computing service providers are dodging
accusations of unreliability).

Here’s
a quick (and incomplete) history lesson in Google App Engine’s failures in
recent years:

On
February 24, 2010, Google App Engine applications experienced degraded operational
states for varying amounts of time (from 20 minutes to two hours) between 7:48
am and 10:09 pm PT. The cause? A power failure in the primary data center that
engineers said was an issue that had been planned for but not everyone on staff
was aware of the processes.

On
July 2, 2009, the outage that occurred between 6:45 am PT and 12:35 pm PT
caused varying degrees of chaos with Google App Engine applications – from
partial to complete applications outages. The cause of the outage was a bug on
the GFS Master server that Google stated was caused by another client in the
data center. An improperly formed file handle hadn’t been sanitized by the
systems on the server side and caused a stack overflow when it was processed. Google
later discovered the bug had been live for at least a year.

On
June 17, 2008, Google App Engine was hit by a datastore outage at 6:30 am PT.
According to Google, only a small number of requests were returned as errors,
but the number of errors continued to increase throughout the morning until
engineers isolated the incident at 1:40 pm PT. Problem solved, and another bug
(this one affecting datastore servers) was found and dealt with.

In
other areas of Google cloud computing, the company has often had to deal with
surly customers complaining about Gmail or Google Apps outages, but it’s hardly
a new tale in the realm of cloud. Google certainly isn’t the only cloud
computing service provider that experiences its share of outages, and fingers
can easily pointed towards unreliability in a variety of directions.

 

 

RELATED ARTICLES

Must Read