At one of the places that I worked, uptime was a pretty big deal. If a machine had less than 100 days of uptime, clearly something was wrong. If you’re running Windows then you’re going to have to reboot for monthly patches, and Windows desktops would be rebooted weekly or daily. By contrast, we had boxes whose uptime stretched to 200, 300, 500, or more days. We were convinced that this meant that we were providing excellent service.
The reality is that uptime doesn’t mean a thing. What does matter is availability. It’s easy to confuse that - if the machine’s up, it’s available right? However, if you have a machine that hasn’t rebooted in a year, this is what I can tell you about it
- Unless you’re using ksplice, you haven’t updated your kernel in a year.
- You haven’t run a fsck on your root partition in a year.
- You’re probably in the position where you have no failover options.
- When this machine goes down, you’re going to have to go into crisis mode.
Uptime isn’t a badge of pride, and if you are focusing only on uptime then you are ignoring the cost of the inevitable downtimes. If you’re trying to maintain uptime because you don’t have a way to maintain service, you have two options - bite the bullet and take the availability hit, or forego rebooting and take on the risks of vulnerability and a possibly prolonged downtime due to unaddressed issues. The opposite of downtime isn’t uptime, it’s availability.
A point to ponder on this topic is the Netflix chaos monkey. If you haven’t heard of it, chaos monkey is basically a service that randomly kills instances at Netflix. By enforcing this element of unpredictability, you can’t get away with having single points of failure. If your machines are randomly dying then uptime isn’t so much a metric of your badass sysadmin skills as it is luck that a machine hasn’t been blown away. With this constant threat you have to engineer systems that are durable enough to survive this.
It could initially seem that chaos monkey is an artificial system that’s not realistic for normal shops, but really it’s just hastening the inevitable. Hardware fails, power cables get ripped out, and people make errors and blow away machines. It happens. Even if your hardware is flawless, sysadmins are still human and errors happen. If you’ve got a team of precise rockstar sysadmins, all it takes is one server letting out the magic smoke. If you believe you don’t need to worry about hardware or human error, then either you’re deluded, not looking hard enough, or you’ve gotten lucky, to this point.
The problem is that high availability is hard, and it doesn’t directly produce value - to the outsider, it probably looks like overhead. In the same vein, backups look like overhead to outsiders. The only time that these things have value is when shit goes horribly wrong and you need something to fall back on. How do you make a case for this? If someone wants storage, they don’t ask for storage and nightly backups, they want storage and if something goes wrong, then they want backups. In the same vein, people may be unenthused if they only asked for a simple webserver and find out that either the cost or engineer time is double what they expected. And when your cloud provider suddenly burns into flame, if you haven’t made the effort to do that high availability stuff they’ll be doubly unhappy.
Any time that you see a box has over 200 days of uptime, consider asking yourself - “is this box still running because it’s particularly stable, or is it up because we have no alternative?” If it’s the latter, perhaps ponder on how this could be rectified. If you only play the short game, then you’ll be spending all your time maintaining systems and not have opportunities to make improvements.