Soundtrckr Tribe

The cloud giveth…

Posted: April 22nd, 2011 | Author: | Filed under: Uncategorized | Tags: , , | No Comments »

The following is a guest post written by Matthew Cox and Earl Kinney who are the team members that work with Amazon’s cloud to deliver performance to the Soundtrckr services.

If you are an active user and only want to know what the result of the outage was:

  • Any accounts or stations created after 1AM GMT on April 21st will be missing.
  • If your account does not work and it was created in this window: please simply recreate it.
  • The same applies to missing stations.

Update: April 25, 2011

We have heard from AWS that at least one of our volumes cannot be recovered. Unfortunately, this means we won’t be able to make an accurate assement of the number of accounts or stations that were lost recovering the service from a backup.

The remainder of this post is a more in depth discussion of the outage and our future plans.

Over the last twenty fours hours we have experienced was what essentially a worst case scenario for infrastructure failure as it relates to providing our services. At this point there has been much coverage about Amazon’s EC2 failure and its wide reaching affect on many sites and services. Without a doubt, there will be much more as Amazon recovers and delivers its own postmortem.

As with any small team, we constantly face a balancing act between developing new features and continuing to enhance the infrastructure as needed to support growth and changing performance profiles. Although we have instances running in different availability zones in Amazon’s US East region, the failure of EBS in that entire region took us down.

As the various zones were returned to service, the one zone that continued to have problems was the one that held the most important portion of the Soundtrckr infrastructure: our database instance, its data volumes and backups.

While we have the ability to deploy additional mobile or web instances quickly, redeploying the database infrastructure is more of a time investment. We were initially confident that Amazon would be able to restore services and we could avoid having to deploy an entirely new database instance and incurring the small data loss that would accompany such a switch.

As it became clear that the outage was going to go well beyond 24 hours with no firm time table for complete restoration of services, we prepared a new instance and redeployed this morning using the most recent database snapshots prior to the outage incident. Unfortunately, this did result in a small amount of data loss. We won’t know exactly how much until Amazon completely restores service. We know that any accounts or stations created after 1AM GMT on April 21st will be missing.

 

We were pleased that previous preparation enabled us to stand up basic downtime notification for the web services. We learned that all mobile clients need enhancement to deal with such extreme cases of service failure. The time line for deploying the client enhancements will vary from platform to platform.

Immediately, we have made some tweaks to the backup system in place for the database infrastructure. This decreases the time investment needed to swap out instances and narrows the window for potential data loss. We will also accelerate our plans for live replication of database content to multiple Amazon EC2 regions.

Again, we appreciate your patience. We will continue to push development of Soundtrckr’s infrastructure.