We run several types of centralized storage for City Cloud. Some have been more successful than others over the years. This year we have made large investments in a very attractive solution called Gluster (www.gluster.com). The fundamentals of this storage solution is very solid. However one concern with all the new solutions that create highly redundant and scalable storage pools is that they also become more complex. This complexity has caused a number of issues over the past weeks.
On Thursday the 20th we had scheduled an update of one of our several storage systems. The one in question was the Gluster system which had failed previously and during that failure Gluster had identifed bugs which caused disturbances of various kinds. We also had identified (thanks to the Gluster-folks) that we were running drivers for some of our network interfaces that also contained a known bug. Decision was made to update both Gluster as well as those affected network interface drivers.
Upgrade started and finished as planned. At 4 AM all servers were back online (about 1000 virtual machines). Just a couple of hours later we started noticing issues. In a few minutes two nodes of two different replicating pairs experienced network failures. Still not a problem due to Gluster redundancy but clearly a sign of something not being right. While in discussions with Gluster to identify the cause one more node experiences network failure. This time in one of the pairs that already has a node offline. This causes all data located on that pair to become unavailable. The suspicion at this time is that the problem is related to link aggregation so that is disabled on all nodes. The nodes with network issues were taken online again and for a period of time self heal operations were in progress.
Disabling aggregation probably made the situation better, but did not solve the problem. Around 4 PM the network outages reappeared, this time only on one node but confirming the the problem had not yet been completely resolved. At this time most VMs are up except a few with lingering issues due to the outages. At around 8 PM a potential software bug with the specific driver / hardware combination is identified and an update is done to the node that is currently down. Due to the severe risk of more nodes loosing network connectivity which would result in massive outages an emergency update is planned for 10PM.
The upgrade goes well, and self heal for the nodes that has been down is initiated so performance is somewhat impacted during the night.
At this time about 100 of the 1000 VMs affected had some form of issue that needed manual attention to start. As the servers lost storage abruptly there were certain types of Gluster issues where files did not match each other on the two nodes in each storage pair. There were also some cases of data corruption in the VMs filesystems due to VMs going down in an uncontrolled way.
Together with technicians from Gluster and Enomaly we had three teams working around the clock making sure each and every VM got the attention needed. Depending on the individual problem some VMs got up quickly and some took a lot longer and it was not until Sunday night as good as all VMs were up and running again. We are still aware of some customers needing more assistance with individual cases, and that is what we are currently working on.
The following days will mostly include the individual assistance mentioned above as well as monitoring the Gluster system closely to ensure that no lingering problems are still present.
We are devastated over the down time this issue has caused some of our customers. We realize the importance of up-time and will discuss further with Gluster to make sure we get guarantees of the high-availability which they guarantee. If not – we have contingency plans to migrate from Gluster all together. Decisions will be made in discussions with Gluster and our technical staff.
Comments and interactions
Pingback: October analysis and future plan « MI Pro Service Status()