Limited outage – post mortem

2011-10-12 Comments: 5

During last week about 10% of our City Cloud customers had some form of outage. In this post mortem we want to give you full transparency as to what happened and how we will continue to work towards a permanent solution.

City Cloud is designed with a large amount of blade server actually running the virtual servers and a large storage systems behind them. Several storage systems are involved but the one we have had issues with as of last week (and also in June) is a Gluster storage system.

As early as this summer we identified a number of bugs in Gluster related the the way Gluster handles large files under certain conditions. This could affect performance and even stability of a VM. Most of these bugs were fixed during an emergency update on the 21st of June.

However last week we ran into further issues where a combination of problems lead to the service interruptions you as a customer may have been impacted by.

The Gluster system is comprised by several nodes, which is basically servers with lots of disk. Those nodes work in pairs to ensure that all data is locate on two nodes so downtime (or even a complete system failure) of a single server never affects data availibility. When a node has been down for some reason a process called self heal ensures that data between the nodes gets in sync.

The problem started last week by a single Gluster node going down and required a reboot. This wasn’t a problem as the other node kept serving the data without interruption as expected. When the downed node was taken up again self heal initiated to get it in sync with the other node. During self heal some virtual machines were affected by decreased performance and in some cases even down time. When self heal finished all VMs went back to normal state again. This has been identified as an issue in how Gluster does self heal.

Over the next days this happened two more times, on different nodes. Each time self heal had to be done again affecting performance on the VMs located on the affected pair. On Thursday when a node went down we left it down for a few hours, to allow us and Gluster to research the root cause of why it went down. During those hours the second node of that Gluster pair also went down leaving that part of the data unavailable. A limited number of virtual machines lost access to it’s storage and obviously went off line. The node was taken online again within minutes, however not all VMs started automatically and were left turned off having to be manually started. Yet some VMs required manual assistance by us to be startable again.

At this point we have moved some affected virtual machines to other storage systems (non-Gluster) and we are awaiting an update from Gluster to fix some of the issues that are the cause of the issues last week and also the root cause for handling the larger files.

We will shortly give you dates for maintenance windows which will be used to update the Gluster software and also the underlying operating systems. We are in daily touch with Gluster and have full confidence that they will address all issues.

Expect more information via e-mail and here on the blog as well. We thank you for your patience should you have been affected by last weeks storage issues.

Johan – CEO City Network

Comments and interactions

  • Niklas

    Things like this happens, but I did not like discovering my VM was shutdown yesterday having no idea what had happened. Next time (let’s hope there’s no next) I suggest you inform the affected customers when there’s a service outage / VM shutdown, so that we can take care of _our_ customers wondering why they can’t access their website etc.

    • https://www.citynetwork.se/ Johan

      We do inform during the time we have an issue. However it is way to hidden – like this page for instance: http://www.citynetwork.se/driftstatus/ - we need a dedicated status page. However do follow us on our twitter account as per the reply above – we try to inform every 30-60 minutes during an unplanned outage.

      Thank you for your patience and do contact our support should you have any other issues.

  • Jonas Ek

    Let customer information be a key performance and quality indicator. I would rather see too much information during incidents than too little!

    • https://www.citynetwork.se/ Johan

      No doubt. I would not agree more. Information is of course key during and also after any type of incident. We can get better. We try to be very active on our @citycloud Twitter account as it is the most quick way to reach all those of interest.

  • http://twitter.com/bitconstructor BitConstructor

    well the outage has repeated again. It is difficult to explain our customers why all their servers are going down not once but now multiple times. for other our customers the performance in the last two weeks (since the outage) was less than satisfactory