Welcome

In the mid-1980s, I was witness to an incident where an upgrade to the SS7 software used in AT&T's long distance network took most of North America's long distance service down hard for more than twenty-four hours. It was then that I began formulating what came to be called Pinkston's Law: MOST OUTAGES BEGIN AS UPGRADES

Over the years since, I have seen this happen so often that whenever I hear of a major telecom or data service outage, my first thought is, "Must have been an upgrade. Pinkston's Law." In the vast majority of cases it turns out that that's exactly what it was! So, at the urging of those closest to me, I've started this blog to chronicle the occurrences of Pinkston's law whenever I hear of them.

Friday, December 25, 2009

Google's Gmail Outage Caused by Upgrade Error: 09/01/2009


  • Length of outage: Two hours

  • Number of people affected: Unknown - certainly millions
Gmail is a very popular free webmail services that many people use daily. What is less well-known is that Gmail is also used extensively in business as a paid, enterprise-grade services.
So, while folks who use the free personal email side of Gmail are annoyed when it goes down, business users are -- understandably -- furious.
News story:
http://www.eweekeurope.co.uk/news/google-s-gmail-outage-caused-by-upgrade-error-1738
While Google does have the admirable mission statement, "Don't be evil," they are sometimes quite tight-lipped about specific causes of outages. This time they made it clear, in a statement by Ben Treynor, "VP Engineering and Site Reliability Czar":
http://gmailblog.blogspot.com/2009/09/more-on-todays-gmail-issue.html
Here's what happened: This morning (Pacific Time) we took a small fraction of Gmail's servers offline to perform routine upgrades. This isn't in itself a problem — we do this all the time, and Gmail's web interface runs in many locations and just sends traffic to other locations when one is offline.

However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system "stop sending us traffic, we're too slow!". This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn't access Gmail via the web interface because their requests couldn't be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don't use the same routers.

The Gmail engineering team was alerted to the failures within seconds (we take monitoring very seriously). After establishing that the core problem was insufficient available capacity, the team brought a LOT of additional request routers online (flexible capacity is one of the advantages of Google's architecture), distributed the traffic across the request routers, and the Gmail web interface came back online.
I have to commend Google -- and Mr. Treynor in particular -- for being forthright about the outage, and providing a textbook case of Pinkston's Law. This case also illustrates the tendency for failures in one part of a network to cascade to other parts, often in an unexpected fashion.

No comments:

Post a Comment