How I learned to stop worrying and love uptime

ozgurCity Cloud5 Comments

uptimeUptime, that golden cauldron at the end of the rainbow. Everybody is out to get it, we throw everything we can in the hope that our servers stay on all the time.

No easy task but thanks to cloud computing, the hardware part started to gain a few benefits like automatic disaster recovery and fail-over, provided you did your homework and set up your datacenter right.

Now, what about the software?

All about uptime

To those of you that are not obsessed with this, allow me to explain. Uptime is the time your server has been up since the last boot. Who knew, right?

Uptime is just a number but the implications are vast. Uptime leads to what we call SLA’s or Service Level Agreement, which is based on how much the servers have been up on a given period as a parameter for a service agreement.

You have uptime on every computer that has been started and is still on. Let’s see how to check this on Windows 7, load up the Task Manager (usually with Ctrl+Shift+ESC key combination) and go to the “Performance” tab.

uptime on windows

Here you’ll see how long has your computer been running under “Up Time”, which has days:hours:minutes:seconds format.

On Linux is even easier, just run the following command on a console:

uptime

Opposed to uptime we have downtime and the importance of this is apparent when, for instance, you just paid for a campaign and much more users are trying to access your site and all of the sudden it cannot be accessed. Or when a disk fails, since hardware is bound to fail sooner or later and you just lost a couple thousands dollars because users cannot get to the ecommerce part of the site.

The cause of downtime may come from several sources. It could be hardware related, a router here and there that failed to transmit a packet, a thunderstorm denying a complete datacenter or it could be software related where a web server is not properly setup and it completely clogs the server or a kernel panic just blocked access to everything.

This is why we strive to have all kind of tools that ensure us that we have the best uptime possible.

Monitoring

Monitoring is the process of regularly checking your servers to see if there is something wrong. This can be done in the same network, in another network but in the same datacenter or even externally. It’s usually better to have multiple monitoring services but of course, each component adds to maintenance work.

What we usually check is load (cpu, disk, memory) and accessibility. Different types of access cuts demand for different monitoring solutions. You can have a simple one like an external site service that just tries to download a simple file from your site or you can go all the way and have a full fledged monitoring service with local agents (like Nagios for instance).

Monitoring is one thing, but as important is the time you take to respond to issues. Some systems do this automatically and let you know, some other less fortunate times require that you take care of the situation. When you are on working hours, that’s part of the job but when you are sleeping or otherwise on the road with a shaky Internet access, you start to wish that you had better redundancy in place.

Redundancy

Redundancy is just an umbrella term that refer to a set of technologies and methodologies that allow any system to have automatic disaster discovery and recovery. In other words, the system is designed in a way that in the event of a failure, backup systems will take place and the entire service keeps on running without any customer or visitor noticing.

These systems are usually automated, leading to another node or database coming to the forefront as soon as the primary component fails. These systems are smart enough to resume normal operations, since a failing node may come back at any time. They also let you know via reporting what the problem was and if a solution took place.

There are two concepts that go hand in hand: redundancy and scalability. The reason is because they both need underlying technologies that are similar and they have to deal with elasticity, that is, being able to allocate resources when the traffic increased or when a backup system is needed.

Some concrete examples include a webserver that has a load balancer as a point of entry. The load balancer knows which web servers are available and redirects traffic accordingly. What if one of the web servers fails? The load balancer will take notice on the next probing and then remove that from the request queue. The probing will continue until that web server is back online, which could be automatically or manually.

Another common example is when you have a database. With replication you could have that database synchronized against other servers that will become the master once the primary fails. There are a lot of headaches associated to this but the point is that while it’s almost impossible to achieve 100% uptime, the more time your server is available, the better.

Which leads us to the nine nines. You have probably seen this term tossed around the web. It’s basically a way to describe how reliable a system is in terms of uptime. You can check the Wikipedia article for more information.

Planning ahead

Are we expected to solve all of this as soon as a server fails? It sounds complex! Well, that’s because it is and why is better to plan ahead.

Planning ahead means taking the estimated traffic into consideration when designing the system. Cloud computing in that regard is great, because it lets you easily add resources (or remove them) without having to worry about calling your hardware provider only to realize that you didn’t need them anymore or that you needed even more hardware.

To be fair, estimating traffic is a moving target and a really elusive one at that. Which leaves you to have an approximation of when and how the system is going to crash or need more resources to guarantee uptime. In our experience, you are going to experience this sooner or later. You could avoid some troubles by running simulations, which is feasible using virtual servers. This way you may run into a problem before it actually happens, apply a solution and then move on.

The subtext is that planning will save you unnecessary headaches.

Conclusions

We have barely touched the surface of what can be done to have the best uptime possible. There are entire books on reliability, scalability, automation and all of these interesting subjects. If all of this feels overwhelming is because it is, so don’t get your hope down and start simple. A web server, a database server, add a new web server and load balance the two of them, then replicate the database and so on.

And with a hosting provider like City Network, you have to worry only about your site. In City Cloud where the virtual dedicated servers are controlled by you the situation is different, but even so you can experiment without incurring heavy costs.

Uptime ends up then being key so a visitor does not have to worry about a missing or non-responsive site. A user will only realize when the site is down and assume a 100% uptime. Yes, it’s a hard cold world but since everyone expects that, we should be up to par with the expectations as possible.