A walk through the InformationWeek archives

The cloud is growing, but cloud outages are nothing new. And neither do we. InformationWeek was founded in 1985 and our online archives date back to 1998. Here are some highlights of the cloud’s worst times, taken from our archives.

April 17, 2007 / In Web 2.0 keynote, Jeff Bezos touts Amazon’s on-demand services, by Thomas Claburn — “Conference founder Tim O’Reilly asked if Amazon was making money on this, Bezos replied, ‘We definitely intend to make money on this “, before finally admitting that AWS was not profitable today.”

(As a reminder, folks, here in 2022, AWS is now worth a trillion dollars.)

August 12, 2008 / Sorry the internets are broken today, by Dave Methvin and Google apologizes for Gmail outageby Thomas Claburn — After a series of awkward disruptions in the Microsoft MSDN, Gmail, Amazon S3, GoToMeeting, and SiteMeter forums, Methvin laments, “When you use a third-party service, it becomes a black box that’s hard to check, or even know if or when something has changed Welcome to your future nightmare.

October 17, 2008 / Google Gmail outage brings out opponents of cloud computing, by Thomas Claburn — Given that the outage appears to have lasted more than 24 hours for some, affected paid Gmail customers appear to owe service credits under the terms of Gmail’s SLA. As one customer put it, “It’s not a temporary issue if it lasts this long. It’s frustrating not being able to expedite these issues.”

June 11, 2010 / The five biggest weaknesses of the cloud, by John Soat – “The recent problems with Twitter (“Fail Whale”) and Steve Jobs’ embarrassment over the network blackout during the introduction of the new iPhone don’t exactly convey warm fuzzy feelings about the internet and performance of the network in general. An SLA cannot guarantee performance; it can only punish poor performance.”

[In 2022, a cloud SLA can accomplish basically nothing at all. As Richard Pallardy and Carrie Pallardy wrote this week, “Industry standard service level agreements are remarkably restrictive, with most companies assuming little if any liability.”]

April 21, 2011 / Amazon EC2 outage hinders websitesby Thomas Claburn / April 22, 2011 / The cloud takes a hit, Amazon needs to fix EC2, by Charles Babcock / April 29, 2011 / Post-mortem: When Amazon’s cloud turned itself on, by Charles Babcock – Amazon’s “Easter Weekend” outage that impacted Yard, Foursquare, Hootsuite, Heroku, Quora, and Reddit, among others. Babcock writes, “By building high availability into cloud software, we have escaped the limitations of hardware failures that have interrupted running systems. In the cloud, hardware can fail and everything else continues to function. On the other hand, we’ve discovered that we’ve entered a higher operating atmosphere and a larger aircraft on which potential failures can occur.

“The new architecture works great when a single disk or server fails, a predictable event when running tens of thousands of devices. But the solution itself doesn’t work if it thinks hundreds of servers or thousands of disks have failed at the same time, taking valuable data with them. This is an unplanned event in cloud architecture because it’s not supposed to happen. It didn’t happen either last week.But the governing cloud software thought he did and sparked a massive recovery effort. This effort in turn froze EBS and the relational database service in place. Server instances continued to work in US East-1, but they couldn’t access anything, more servers couldn’t be launched, and the cloud stopped working in one of its zones. availability for all practical purposes for more than 12 hours.”

August 9, 2011 / Amazon Cloud Outage: what can we learn? by Charles Babcock — A lightning strike in Dublin, Ireland, took Amazon’s European cloud services offline on Sunday and some customers were expected to be unavailable for up to two days. (Lightning will appear in other outages in the future.)

July 2, 2012 / Amazon outage hits Netflix, Heroku, Pinterest, Instagram, by Charles Babcock — Amazon Web Services data center in US East-1 region loses power due to severe electrical storms, knocking out many website customers.

July 26, 2012 / Google Talk, Twitter, Microsoft Outages: Bad Cloud Day, by Paul McDougall / July 26, 2012 / Microsoft investigates Azure outage in Europe, by Charles Babcock / Mar 1, 2012 / Microsoft Azure’s explanation is not soothing, by Charles Babcock — Google reported that its instant messaging and video chat service Google Talk was down in parts of the United States and around the world on the same day, Twitter was also offline in some areas, and the Microsoft’s Azure cloud service was available across Europe. The Microsoft leader’s post-mortem on the Azure cloud outage cites a factor of human error, but leaves other questions unanswered. Does this remind you of how Amazon played its first lightning incident?

October 23, 2012 / Amazon outage: multiple zones, a smart strategy, by Charles Babcock — Traffic at Amazon Web Services’ busiest data center complex, US East-1 in Northern Virginia, has been grounded by an outage in one of its Availability Zones. Damage control began immediately, but the effects of the outage were felt throughout the day, said Adam D’Amico, Okta’s director of technical operations. Savvy customers, such as Netflix, who have made a major investment in using Amazon’s EC2, can sometimes avoid service disruptions by using multiple zones. But as reported BNC Newssome regional Netflix services were affected by Monday’s outage.

Okta’s technical operations manager told Babcock that they use all five zones to guard against outages. “If there’s a sixth zone tomorrow, you can bet we’ll be there in a few days.”

January 4, 2013 / Amazon December 24 Outage: A Closer Look, by Charles Babcock – Amazon Web Services once again cites human error propagated by automated systems for loss of load balancing in a key installation on Christmas Eve.

November 15, 2013 / Microsoft pinpoints Azure slowdown on software failureby Charles Babcock — Mike Neil, General Manager of Microsoft Azure, explains the October 29-30 slowdown and the reason for the widespread failure.

May 23, 2014 / Rackspace resolves cloud storage outage, by Charles Babcock — Shortage of SSD capacity is disrupting operations for some Cloud Block storage customers at Rackspace’s Chicago and Dallas data centers. Rackspace’s Status Reporting Service said the issue “was due to higher than expected customer growth.”

July 20, 2014 / Microsoft explains Exchange outage, by Michael Endler — Some customers couldn’t reach Lync for several hours on Monday, and some Exchange users went nine hours Tuesday without email access.

August 15, 2014 / Practice Fusion EHR Caught in Internet Brownout, by Alison Diana – A number of small medical practices and clinics sent patients and staff home after the site of cloud-based electronic health records provider Practice Fusion was part of a global outage two days.

September 26, 2014 / Amazon Restarts Cloud Servers, Xen Bug Blamed, by Charles Babcock — Amazon tells customers it needs to patch and restart 10% of its EC2 cloud servers

December 22, 2014 / Microsoft Azure outage blamed on bad code, by Charles Babcock — Microsoft’s analysis of the November 18 Azure outage indicates that engineers’ decision to widely deploy misconfigured code triggered a major cloud outage.

January 28, 2015 / When Facebook is down, thousands slow down, by Charles Babcock – When Facebook went down this week, thousands of websites linked to the social media site also went down, according to Dynatrace. At least 7,500 websites that depend on a JavaScript response from a Facebook server have had their operations slowed or blocked by a lack of response from Facebook.

August 20, 2015 / Google loses data: Who says lightning never strikes twice? by Charles Babcock — Google experienced high read/write error rates and small data loss at its Google Compute Engine data center in Ghislain, Belgium, from August 13-17, following a thunderstorm that caused four lightning strikes at or near the data center.

September 22, 2015 / Amazon disruption produces cloud downtime spiral, by Charles Babcock — The failure of Amazon DynamoDB early Sunday triggered cascading slowdowns and outages that exemplify the highly connected nature of cloud computing. A number of web companies, including AirBnB, IMDB, Pocket, Netflix, Tinder, and Buffer, have been impacted by slow service and, in some cases, service disruption. The incident began at 3:00 a.m. PT Sunday, or 6:00 a.m. in the place where it had the greatest impact: Amazon’s busiest data center complex in Ashburn, Va., also known as the name of US-East-1.

May 12, 2016 / Salesforce Outage: Can Customers Trust the Cloud?, by Jessica Davis — The Salesforce service outage began Tuesday with the company’s NA14 instance, affecting customers on the West Coast of the United States. And although service was restored on Wednesday after nearly a full day of downtime, the instance continued to experience service degradation, according to Salesforce’s online status site.

March 7, 2017 / Is Amazon’s growth a bit out of control? by Charles Babcock — After a five-hour S3 outage in US East-1 on February 28, AWS Operations says it was harder to restart its S3 indexing system this time than last times they tried to restart it.

Babcock writes: “Since the outage began with a data entry error, numerous reports of the incident described the event as being as explicable as human error. The human error involved was so predictable and common that this is an inadequate description of what is It only took a minor human error for AWS’ operational systems to start working against themselves It’s the automated nature and rampant path of failure that is troubling. Automated systems operating in ways that are inevitably doomed to failure are the hallmark of immature architecture.”

Fast forward to today

As Sal Salamone carefully detailed this week, in his article on lessons learned from recent major outages: Cloudflare, Fastly, Akamai, Facebook, AWS, Azure, Google and IBM all experienced calamities similar to this in 2021 -22. Human errors, software bugs, power surges, automated responses to unintended consequences, all wreak havoc.

What will we be writing about cloud outages in 15 years?

Maybe more of the same. But you may not be able to read it if there is lightning in Virginia.

What to read next:

Lessons learned from recent major outages

Can you recoup losses incurred during a breakdown?

Special Report: How fragile is the cloud really?

Comments are closed.