At about 1 AM pacific time Wednesday the 21st of April Amazon AWS – Cloud Computing started having performance issues that escalated to something that has taken down lots of major sites down including reddit, Quora, Foursquare and Hootsuite to mention a few. At City Network we use Co-Tweet as our preferred Twitter app. Co-Tweet was also down for more than a day as Amazons problems continued. Quora is still today, three days after the problems started, working on recovering from the down time as you can see from the image to the right.
While you could have designed (adding plenty of complexity) your systems to run over several regions in AWS, those following best practice and designing over more than one so called availability zones still had issues. It points to a possible single point of failure over two or more zones. Designing over several regions (data centers) often poses a whole lot of other problems as you would have to deal with the latency going over the public Internet.
Two days after the start of the failure this could be read on reddits site:
“reddit is in "emergency mode" right now because Amazon is experiencing a degradation. they are working on it but we are still waiting for them to get to our volumes. there is no ETA at this time, but we are trying to work some magic and will very slowly be bringing the site back up. please stand by.”
With one region and multiple availability zones affected in the worlds leading cloud computing service – what is one to do? Amazon has, after all, probably spent the most money in building a highly redundant service. For true redundancy you need more.
This is where a solution like Cloud Foundry, WMwares newly launched open initiative, is an extremely interesting approach. To reach the ultimate redundancy you would not only build out and scale over several regions (if provided by your cloud computing provider) but also over multiple providers. Until now this has been extremely costly and have added a lot of complexity to any solution. However Cloud Foundry is onto something and City Cloud will soon launch images of Cloud Foundry.
There is no doubt that cloud computing is here to stay. We are not only speaking of services that are as reliable or better than in-house services but creates a whole new dynamic in companies using cloud computing. Yes there are great financial benefits, however the flexibility of cloud computing creates dramatically improved innovation in just about any size company. Obviously more important than driving down cost. Clearly Amazons issues high lights the need for both redundancy over data centers as well as providers and with City Cloud we will strive to deliver this to our customers with both multiple regions but also the right PaaS services to make it truly viable thanks to less complexity. We are all about openness and realize we need to earn your business every day because it is or will be easy to move between providers in the future of the open cloud. We want to support this and make sure the open cloud reaches you sooner rather than later.
We would also like to point out that for a lot of customers the secondary problem is communication. The notion among many companies in Europe is that it is hard to get good communication going with the larger companies such as Google and Amazon. How do you feel communication worked during the days you were down with Amazons service? Did you get proper updates about your volumes that you still might not have access to? Could you call somebody that could explain? While up time is key – we think proper communication is as important – especially in times like these with lots of down time in many major services. That goes for being able to deliver the message in your language and easily understood.
On the down time note – Amazon certainly is not the only one having trouble this week. Sony has had a lot of hiccups and down time in its online gaming service PSN as of late. This week we will be counting PSNs downtime in days, not hours. This leaves millions of frustrated gamers off line. There is more work to be done. The cloud is moving rapidly forward to make adjustments to allow for real redundancy and make 100% up-time a reality.
You can follow the status of Amazons services here on its Service Health Dashboard. As we post this on Saturday the 23rd Amazon is still having problems but lots of customers are now up and getting to a lot of their data – if not most.
If you are interested in reading more please visit some of these links: