What happens when the cloud goes dark?

I’m sure you’ve either read about the Amazon Web Services outage of the weekend or visited a site that uses their architecture, such as Quora or Foursquare.

One part of their servers on demand product had issues – specifically their Elastic Block Storage product in one of their availability zones. Many servers use it for persistent storage, something the AWS EC2 product doesn’t offer by default. With these volumes being flaky, throwing errors or being office, many sites were in trouble.

The services that we use the most here at John Carroll, the Simple Storage Service (S3) and the Cloudfront content delivery network were not affected, thankfully, so I could enjoy the holiday weekend. I would have liked to play some online games on my PS3, but as you’ll see below, that too was off-line.

So what are some takeaways I see coming out of this outage?

First, don’t put all your eggs in one basket. SmugMug CEO Don MacAskill posted a very detailed blog post about the Amazon outage and how and why his company’s servers there weren’t affected. He says:

All of our services in AWS are spread across multiple Availability Zones (AZs). We’d use 4 if we could, but one of our AZs is capacity constrained, so we’re mostly spread across three. (I say “one of our” because your “us-east-1b” is likely different from my “us-east-1b” – every customer is assigned to different AZs and the names don’t match up). When one AZ has a hiccup, we simple use the other AZs. Often this is a graceful, but there can be hiccups – there are certainly tradeoffs.

Second, if you are going to leverage the cloud for services, and you should, you must have a backup plan or set of protocols for what to do if it hits the fan.

For example, if S3 did go down, our WordPress CMS would be affected, as we store user-uploaded assets in S3. To remedy that, we keep a local copy on our server, so our assets stay available to our site visitors. If S3 goes down, we can make a change to a plugin configuration and our assets will still be available. When S3 comes back online, we’d flip the switch and go back to serving things from the cloud.

Third, have a communication plan ready and keep users updated during the day.

The only spot I was finding out official information on the outage was on the AWS Service Health Dashboard, which is fine, that’s where it should be. In addition, many sites put up their own pages (Quora, Reddit come to mind) saying their were being affected by the outage.

If you have a blog, use it. Same goes for Twitter and Facebook. Amazon, even though the info was hidden, was good with updating exactly what was going on and where they were in the process of getting services back online. For example:

Apr 24, 5:05 AM PDT: As detailed in previous updates, the vast majority of affected EBS volumes have been restored by this point, and we are working through a more time-consuming recovery process for remaining volumes. We have made steady progress on this front over the past few hours. If your volume is among those recently recovered, it should be accessible and usable without additional action.

Good information that’s being updated often is important to help keep customers in the loop. Compare that to Sony, who’s Playstation network has been offline since last Wednesday. Their updates have been nebulous, at best. On April 21, they posted on their official blog:

While we are investigating the cause of the Network outage, we wanted to alert you that it may be a full day or two before we’re able to get the service completely back up and running.

The last update given by the company, on April 23, said this:

We sincerely regret that PlayStation Network and Qriocity services have been suspended, and we are working around the clock to bring them both back online. Our efforts to resolve this matter involve re-building our system to further strengthen our network infrastructure. Though this task is time-consuming, we decided it was worth the time necessary to provide the system with additional security.

We thank you for your patience to date and ask for a little more while we move towards completion of this project. We will continue to give you updates as they become available.

And then, silence. It’s now Monday morning in the US and the service is not online and the current status/ETA for being online hasn’t been updated since Saturday. IGN has more on Sony’s PR response to this outage.

That type of communication wouldn’t work on our campuses. Part of your planning must be a communications plan for who is responsible for keeping a certain audience up to date on the status of services.

My colleagues at Allegheny are doing it right this morning. They had a power outage over the weekend and took to their intranet to update the campus community, on a Sunday.

Screen+shot+2011 04 25+at+10 49 47+AM

Am I going to stop using Amazon’s cloud services over this outage? No, definitely not. Is this going to make Amazon improve the service? Yes. Is this a sucky way to do it? Of course.

I’ll be updating this post with feedback from other higher ed web and marketing folks. Andrew Careaga has some interesting thoughts on the outage looking at it through a lens of education.