Pinkston's Law: 2009

Oregon Employment Division Servers and Phones Crash: 10/04/2009

Length of outage: Officially, 10 hours. In reality, 24+ hours
Number of people affected: 165,000 Unemployment recipients, plus OED staffers.

I happened to be one of those affected by this outage, because at the time, I was drawing unemployment!

The original article in the Oregonian the following Monday spun the story to make it sound as if it was the extra load of new people applying for benefits that crashed the system. Even in this later, edited version, you don't find the truth until well down the page:

Original news story HERE.

Here's where the truth comes out:

Problems started Sunday when a computer server crashed while state workers were doing maintenance on the state's computer network. The 60 percent of unemployed who usually file online for their weekly checks turned to the telephone to file their claims on the state's interactive voice response system. At the same time, the group looking for emergency extensions also were swamping the phone lines.

So, they don't explicitly say it was an upgrade, but the system was down when I tried to use it early on Sunday morning, indicating that they had been working on the system during the overnight shift. This smells suspiciously like an upgrade was being applied. Pinkston's Law!

Also, it is an interesting example of the cascading failure effect; when people could not file online, they moved to the phones to file on Monday (so much for the 10-hour outage -- the system was still down Monday morning). The phone system is not sized to handle all of the traffic that the online system handles, so it crashed, too.

Google's Gmail Outage Caused by Upgrade Error: 09/01/2009

Length of outage: Two hours
Number of people affected: Unknown - certainly millions

Gmail is a very popular free webmail services that many people use daily. What is less well-known is that Gmail is also used extensively in business as a paid, enterprise-grade services.
So, while folks who use the free personal email side of Gmail are annoyed when it goes down, business users are -- understandably -- furious.
News story:
http://www.eweekeurope.co.uk/news/google-s-gmail-outage-caused-by-upgrade-error-1738
While Google does have the admirable mission statement, "Don't be evil," they are sometimes quite tight-lipped about specific causes of outages. This time they made it clear, in a statement by Ben Treynor, "VP Engineering and Site Reliability Czar":
http://gmailblog.blogspot.com/2009/09/more-on-todays-gmail-issue.html

Here's what happened: This morning (Pacific Time) we took a small fraction of Gmail's servers offline to perform routine upgrades. This isn't in itself a problem — we do this all the time, and Gmail's web interface runs in many locations and just sends traffic to other locations when one is offline.

However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system "stop sending us traffic, we're too slow!". This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn't access Gmail via the web interface because their requests couldn't be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don't use the same routers.

The Gmail engineering team was alerted to the failures within seconds (we take monitoring very seriously). After establishing that the core problem was insufficient available capacity, the team brought a LOT of additional request routers online (flexible capacity is one of the advantages of Google's architecture), distributed the traffic across the request routers, and the Gmail web interface came back online.

I have to commend Google -- and Mr. Treynor in particular -- for being forthright about the outage, and providing a textbook case of Pinkston's Law. This case also illustrates the tendency for failures in one part of a network to cascade to other parts, often in an unexpected fashion.

Tesco IT upgrade causes till outage: May 11, 2009

Length of outage: 4-24 hours
Number of people affected: 100 retail stores forced to close

Tesco is a major grocery and general merchandise retailer in the UK. North American readers might compare it to Wal-mart or Costco. Tesco launched a big "loyalty scheme" promotion in UK newspapers to its millions of Clubcard holders, which required an upgrade of their software, which caused their tills (cash registers) to malfunction just as the stores opened at 8:00 AM.
Original news story is HERE.
The official statement from Tesco was terse but candid:

"A number of stores were affected by a routine IT upgrade this morning at various locations in the country,” said a Tesco spokesperson.

She might just as well have said, "Blimey! We were struck down by Pinkston's Law!"

Florida Keys Electric Cooperative Power Outage: Oct. 11, 2004

Length of outage: Approx. 1 hour
Number of people affected: Unknown -- most residents of Florida Keys

Here is a relatively rare instance of a hardware upgrade causing an outage. The unique geography and climate of the Florida Keys was clearly a factor.
Read the original article HERE.
Here, I think it best to quote directly from the article to give you a sense of what happened:

One strand of a corroded shield wire unraveled during its removal from service today, causing a power outage from Islamorada to Key West. Florida Keys Electric Cooperative was pulling the wire for replacement when one of its seven twisted strands failed.

The broken strand swung into the energized transmission lines below it, causing a short in the transmission line. The shorted line caused a power outage beginning at 12:40 p.m. The outage began south of Snake Creek Bridge at mile marker 86.

The strand of shield wire failed over water while being pulled along Long Key Channel, complicating correction of the problem.

As a little background, the "shield wire" is the un-insulated wire that runs from pole to pole above the wires that carry the actual current. It is intended to reduce service interruptions and equipment damage by intercepting lightning strikes. In a salt-air environment such as one finds along coastlines, these conductors tend to corrode fairly quickly.

PayPal Upgrade Causes Major Outage, Affects Debit-Card Users: Oct 8, 2004

Length of outage: At least 4 days
Number of people affected: Unknown, but clearly many hundreds of thousands.

PayPal has become such a major part of our lives for online commerce that we often think of it as something that is just "always there to use" like ATMs. But, of course, it runs on a complex network of servers and other equipment, and with those come upgrades.
Read the original article HERE.
Here's the official word from PayPal (I find it interesting that companies in this situation invariably send out a female staffer to read the official statement to the press. Perhaps they reason that it puts a sweeter face on their ~~weasel~~ carefully-chosen words?) :

PayPal spokesperson Amanda Pires said in addition to the new home page, PayPal "added some features on the backend" on Friday that were the cause of the problem. Pires said, "Everyone is working fast and furiously to get it all fixed." The problems are intermittent, she said, but declined to describe their nature or reveal the features that were added on Friday.

Paypal is owned by eBay, and they now have little in the way of competition to keep them on their toes. As something of an "insider" in one of my jobs, I witnessed some PayPal outages and service degradations that were never publicly acknowledged, so I will not cover them here.

Newly Installed Software Causes Outages in MIT's 411 Directory Services: Feb, 1998

Length of outage: Several outages, up to four weeks
Number of people affected: Unknown. All MIT campus phone services affected

This older article chronicles the problems MIT was having with Bell Atlantic's 411 (directory assistance) services in 1997-1998. Apparently there had been a number of failures leading up to the major one in February, 1998.
Read the original article HERE.
Here's a statement from MIT's point of view:

"This was caused by a software change. Since the new software did not interface with ours, we had to reroute traffic," said Valerie L. Hartt, Supervisor of Operator Services in Information Systems.

It seems that Bell Atlantic would periodically perform upgrades on their own equipment which would render it incompatible with the calls they were receiving from MIT's system.

"Part of the problem with this was that Bell Atlantic never informed MIT's 5ESS service team that it would be performing this [upgrade] service ... Therefore, we could not inform the community, nor be available during the upgrade to perform our own testing."

For me, the funniest part of this outage is that fact that both Bell Atlantic and MIT were using identical telephone switches: The AT&T 5ESS, which is still in widespread use.

Batelco (Bahrain) Cellular Network Outage: May 19-20, 2007

Length of outage: Unknown
Number of people affected: Unknown, up to 600,000 possible

Link to the original news story, which quotes the Gulf Daily News:
http://www.cellular-news.com/story/23887.php
The outage caused a bit of outrage:

An influential business source told the newspaper that "a company with nearly BD100 million net profit should have a back-up service because what happened affected the communications of thousands of mobile owners. This is not acceptable nowadays," he said

The outage was blamed on "migration to a New Generation Network (NGN)." I wonder how one says Pinkston's Law in the local language...

Blackberry E-mail Outage: 02/11/2008

Length of outage: 3 hours
Number of people affected: Unknown, North American users of Blackberry's email

Original Story: http://www.cnbc.com/id/23134603
This outage should probably count as at least two examples of Pinkston's Law, based on this quote:

It was the second major outage for the service in less than a year. In April, a minor software upgrade crashed the system for all users. A smaller disruption in September also was caused by a software glitch.

I find it interesting that at least one analyst zeroed in on the existence of a Network Operations Center (NOC) as a contributing factor in the outage:

Any time you got a system that's got a NOC, a Network Operations Center, you have the potential for a single point of failure. What's a bit surprising to me is that with all the work they've been doing over time ... that they haven't been able to have enough redundancy in the NOC so that there isn't a single point of failure.

Pinkston's Law

Welcome

Friday, December 25, 2009