During last week about 10% of our City Cloud customers had some form of outage. In this post mortem we want to give you full transparency as to what happened and how we will continue to work towards a permanent solution.
City Cloud is designed with a large amount of blade server actually running the virtual servers and a large storage systems behind them. Several storage systems are involved but the one we have had issues with as of last week (and also in June) is a Gluster storage system.
As early as this summer we identified a number of bugs in Gluster related the the way Gluster handles large files under certain conditions. This could affect performance and even stability of a VM. Most of these bugs were fixed during an emergency update on the 21st of June.
However last week we ran into further issues where a combination of problems lead to the service interruptions you as a customer may have been impacted by.
The Gluster system is comprised by several nodes, which is basically servers with lots of disk. Those nodes work in pairs to ensure that all data is locate on two nodes so downtime (or even a complete system failure) of a single server never affects data availibility. When a node has been down for some reason a process called self heal ensures that data between the nodes gets in sync.
The problem started last week by a single Gluster node going down and required a reboot. This wasn’t a problem as the other node kept serving the data without interruption as expected. When the downed node was taken up again self heal initiated to get it in sync with the other node. During self heal some virtual machines were affected by decreased performance and in some cases even down time. When self heal finished all VMs went back to normal state again. This has been identified as an issue in how Gluster does self heal.
Over the next days this happened two more times, on different nodes. Each time self heal had to be done again affecting performance on the VMs located on the affected pair. On Thursday when a node went down we left it down for a few hours, to allow us and Gluster to research the root cause of why it went down. During those hours the second node of that Gluster pair also went down leaving that part of the data unavailable. A limited number of virtual machines lost access to it’s storage and obviously went off line. The node was taken online again within minutes, however not all VMs started automatically and were left turned off having to be manually started. Yet some VMs required manual assistance by us to be startable again.
At this point we have moved some affected virtual machines to other storage systems (non-Gluster) and we are awaiting an update from Gluster to fix some of the issues that are the cause of the issues last week and also the root cause for handling the larger files.
We will shortly give you dates for maintenance windows which will be used to update the Gluster software and also the underlying operating systems. We are in daily touch with Gluster and have full confidence that they will address all issues.
Expect more information via e-mail and here on the blog as well. We thank you for your patience should you have been affected by last weeks storage issues.
Johan – CEO City Network