In the mid-1980s, I was witness to an incident where an upgrade to the SS7 software used in AT&T's long distance network took most of North America's long distance service down hard for more than twenty-four hours. It was then that I began formulating what came to be called Pinkston's Law: MOST OUTAGES BEGIN AS UPGRADES

Over the years since, I have seen this happen so often that whenever I hear of a major telecom or data service outage, my first thought is, "Must have been an upgrade. Pinkston's Law." In the vast majority of cases it turns out that that's exactly what it was! So, at the urging of those closest to me, I've started this blog to chronicle the occurrences of Pinkston's law whenever I hear of them.

Saturday, April 30, 2011

Amazon's Elastic Compute Cloud goes down hard

The bigger they are...

On April 21, 2011, Amazon's Elastic Compute Cloud went down when a planned upgrade was "executed incorrectly":
The goal "was to upgrade the capacity of the primary network," Amazon says. "During the change one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS [Elastic Block Store] network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network."

Ultimately, this meant a portion of the storage cluster "did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn't handle the traffic level it was receiving."
