While in beta we have had great hopes of sticking very close to 100% up time in City Cloud. However yesterday (Sunday) we had our longest outage where some customers were out of service for as long as 10-11 hours. Completely unacceptable of course – in particular in a service were we want to set a new standard for up time. With this blog post we want to spend a few minutes giving more detail as to what happened.
To understand better, our cloud environment is built on multiple blade servers in the front which connect to a common storage pool via redundant switches and servers for performance. We have made multiple changes to how we connect to our storage since we started. This to find the ultimate balance and how we can grow in a sound way and get best performance.
Our problems started early Sunday morning just after 06:30 CET. An indication had come in earlier where a customer had expressed that their server was “slow” based on the load that they were seeing. We did investigate this and it happened to be that one blade had been very heavily loaded and we made sure a few VMs were moved to a different blade server with less load.
In a similar fashion data started moving slowly at 6:30 Sunday morning and by 7:30 a majority of the servers had come to a halt – or where responding so slow that it resembled being down. It was clear that the communication was clogged somewhere. A series of events had taken place where at the heart two faulty controllers where the issue. However time was spent on several areas before identifying the controllers.
With the issue identified – taking down the hundreds of servers took significant time due to the problem with how data flowed through to our storage systems. It was important to let them shut down properly to ensure data integrity – although it caused significant addition to down time.
All servers were up again at approximately 17:30 Sunday afternoon. However with knowledge of some disks reporting problems we chose to also take the system down at 05:00 this morning for about one hour to make adjustments in a number of disks – as a precaution.
We are monitoring the various controllers as well as disks that caused the trouble. We are also discussing how we will improve on this not happening again – despite existing redundancy.
While financial compensation does not recover lost up time – all paying customers will receive compensation as per our guarantees. You will see this in your next invoice where there will be a 50% deduction made for this month.
We hope this gave some insight – albeit at high level.
Thank you for being a City Cloud customer and thank you for you patience with this issue.