In the mid-1980s, I was witness to an incident where an upgrade to the SS7 software used in AT&T's long distance network took most of North America's long distance service down hard for more than twenty-four hours. It was then that I began formulating what came to be called Pinkston's Law: MOST OUTAGES BEGIN AS UPGRADES

Over the years since, I have seen this happen so often that whenever I hear of a major telecom or data service outage, my first thought is, "Must have been an upgrade. Pinkston's Law." In the vast majority of cases it turns out that that's exactly what it was! So, at the urging of those closest to me, I've started this blog to chronicle the occurrences of Pinkston's law whenever I hear of them.

Friday, July 10, 2015

NYSE says 3.5 hour outage caused by software update

On a day filled with stories of hacks and outages, this one seemed to get the most attention.
On Tuesday evening, the NYSE began the rollout of a software release in preparation for the July 11 industry test of the upcoming SIP timestamp requirement. As is standard NYSE practice, the initial release was deployed on one trading unit. As customers began connecting after 7am on Wednesday morning, there were communication issues between customer gateways and the trading unit with the new release. It was determined that the NYSE and NYSE MKT customer gateways were not loaded with the proper configuration compatible with the new release.
The "SIP timestamp requirement" mentioned in the statement is an interesting topic in itself. Bloomberg has a bit more detail about this bit of esoterica here: http://www.bloombergview.com/articles/2015-07-09/market-complexity-broke-the-nyse-before-saving-it.

Source: https://www.nyse.com/market-status/history

Saturday, June 21, 2014

Facebook worldwide outage

Facebook was down worldwide for a half hour, and it was widely suspected to be due to a DDoS attack on the popular social media site. But - as usual - it was Pinkston's Law at work. Here's an official spokes-droid's statement:

We ran into an issue while updating the configuration of one of our software systems. Not long after we made the change, some people started to have trouble accessing Facebook. We quickly spotted and fixed the problem, and in less than 30 minutes Facebook was back to 100% for everyone. This doesn't happen often, but when it does we make sure we learn from the experience so we can make Facebook that much more reliable. Nothing is more important to us than making sure Facebook is there when people need it, and we apologize to anyone who may have had trouble connecting.

Thursday, February 7, 2013

Super Bowl blackout - not caused by Beyonce, but possibly by Pinkston's Law?

There is a lot of finger-pointing down in NOLA-land with regard to the 35-minute power failure during the Super Bowl. So far, it does not look like we can blame it on Beyonce's lip-syncing awesome performance at half-time or her membership in the Illuminati.

Instead, it looks like Pinkston's Law may have had a hand:
A recent electrical system upgrade at the Superdome may have contributed to the blackout during the Super Bowl Sunday, officials say.
Full article is at the UPI website.

Saturday, April 30, 2011

Amazon's Elastic Compute Cloud goes down hard

The bigger they are...

On April 21, 2011, Amazon's Elastic Compute Cloud went down when a planned upgrade was "executed incorrectly":
The goal "was to upgrade the capacity of the primary network," Amazon says. "During the change one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS [Elastic Block Store] network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network."

Ultimately, this meant a portion of the storage cluster "did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn't handle the traffic level it was receiving."
 Read the full article at Network World.

Friday, December 25, 2009

Oregon Employment Division Servers and Phones Crash: 10/04/2009

  • Length of outage: Officially, 10 hours. In reality, 24+ hours

  • Number of people affected: 165,000 Unemployment recipients, plus OED staffers.

I happened to be one of those affected by this outage, because at the time, I was drawing unemployment!

The original article in the Oregonian the following Monday spun the story to make it sound as if it was the extra load of new people applying for benefits that crashed the system. Even in this later, edited version, you don't find the truth until well down the page:

Original news story HERE.

Here's where the truth comes out:

Problems started Sunday when a computer server crashed while state workers were doing maintenance on the state's computer network. The 60 percent of unemployed who usually file online for their weekly checks turned to the telephone to file their claims on the state's interactive voice response system. At the same time, the group looking for emergency extensions also were swamping the phone lines.

So, they don't explicitly say it was an upgrade, but the system was down when I tried to use it early on Sunday morning, indicating that they had been working on the system during the overnight shift. This smells suspiciously like an upgrade was being applied. Pinkston's Law!

Also, it is an interesting example of the cascading failure effect; when people could not file online, they moved to the phones to file on Monday (so much for the 10-hour outage -- the system was still down Monday morning). The phone system is not sized to handle all of the traffic that the online system handles, so it crashed, too.