Rethinking Puppet Deployment

This is the third post on a three part series on Puppet deployment.

As the last blog post in this series demonstrated, scaling Puppet deployment is hard. As your deployment grows in size and complexity and you have to maintain more modules, the tools to manage this sort of thing start to break down.


If the current tools don’t cut it, then what do you need? What characteristics should a good deployment tool have?

Speed

First off, a good deployment should be fast. Slow deployments can kill productivity, and it butchers your ability to react as things start happening. If you’re able to deploy something very quickly and something goes wrong, then you can turn around and run another deployment to fix it. Basically, reaction time matters, and it matters quite a bit.

In addition, if you’re using a deployment tool as part of your development workflow then speed is absolutely critical. You want to have a very short feedback cycle between making a change and being able to test it, and if you are constantly waiting for your code to deploy then your productivity is going to be trashed.

Durability

If you’ve spent any time in Operations, then you’ll be well aware that if you rely on a service, it is going to break at the least opportune time possible. For instance, Github and the Puppet Forge have had outages that could cause deployments to fail. As a matter of fact, right now your servers are probably plotting to all fail at once when you most need them.

Since you can expect things to fail, a deployment tool needs to be durable. If a deployment fails part of the way through, the worst thing that you could do is crash and leave things half deployed. If you’re deploying an environment and one module can’t be deployed, immediately failing means that you could have a bunch of missing or stale modules and a completely broken environment.

Instead of killing a deployment on the first error, a much better approach would be to skip over whatever failed and deploy as much as we can. That way we can try to approach the desired state as closely as possible, and while errors may have occurred hopefully they’ll be minimized. Since this hypothetical deployment tool is already supposed to be fast, then it should be easy to fix the failure by retrying the deployment on whatever failed. If the error was due to a transitory error, then there was only a brief time where things were degraded, and the degradation was kept to a minimum.


Say hello to R10K.

R10K is a hybrid solution for deploying Puppet code. It implements the original git workflow for deploying Puppet environments based on Git branches. It takes a configuration like this:

cachedir: '/var/cache/r10k'

sources:
  mymodules:
    remote: 'git@github.com:my-org/org-shared-modules'
    basedir: '/etc/puppet/environments'

And it will create a Puppet environment for each git branch in /etc/puppet/environments.

R10K also provides an implementation of the Puppetfile format. If you’ve been using librarian-puppet inside of your Puppet environments, R10K can provide your modules for you. Since R10K is built for doing in-place deployment of Puppet code, it tries really hard to make sure that the live manifests are always available.

Making R10K Fast

Git plays a very important part if the Puppet ecosystem. For example, 80% of all of the modules on the Puppet Forge host their source code on Github, and Github itself hosts many more Puppet modules. When running deployments with R10K, the vast majority of the deploy time is spent performing operations on Git repositories. As mentioned earlier, R10K is built with performance in mind so speeding up Git is a natural thing to do.

Why is Git Slow?

The typical Git repository carts along the working directory that you normally see, as well as the entire history of the project. For almost all day to day operations with Git this is fine, since you generally only have one or two copies of a given repository. If you have lots and lots of clones of the same repositories, this is fantastically wasteful.

In the best case a Git repository has about the same space used for objects and the working directory. In the worst case a Git repository with a lot of history and a lot of big files could use 10 to 100 times more space than just the working copy. If you’re creating, updating, and destroying multiple copies of the same repository you’ll be carting around this history for each repository, which is a waste of bandwidth and disk space.

Making Git Faster

When R10K clones a repository for the first time, it immediately caches it for future use. Copies of that repository don’t have to contain any objects by themselves. Instead each working repository uses alternate object directories to reference the cached repository and skips all the work of creating a full repository.

After the initial cache, R10K never needs to do a full fetch of that repository again. Even if you’re creating a new clone of that repository, all of the history for that repo is cached so the newly cloned repo only has to check out the working directory.

Lies, Damned Lies, and Benchmarks

How about some numbers?

I pulled these numbers out of our production environment. While there are a few things that could probably be pruned down, this provides a general idea of how much space is being saved.

If we weren’t using R10K, doing full clones of each repository would use about ((30MB + 1.1GB) * 37) = 41GB of space. With R10K, the total space consumption is ~2GB. It’s only a 95% reduction in disk space, so no big deal.

Using R10K Effectively

A lot of current deployment tools take the approach of “update everything, every single time.” This has the obvious benefits of making sure that everything gets uniformly updated. You don’t have to know exactly what you’re updating; if you blast out an update to everything then eventually all recent changes will be deployed.

When you’re developing on a module or environment, you want to have a very quick feedback cycle so that you can spend more time working and spend less time waiting for a deployment to run. If you have a fairly large deployment with a lot of environments and modules, updating everything is the best way to slow your development cycle to a crawl. Even with all of the work R10K does to do things fast, if you try to run a full deployment every single time, you’re going to spend a lot of time updating environments and modules that haven’t changed, that you don’t care about.

When you’re working on an environment or module, chances are you only want to deploy that work, and that work alone. You can’t run much faster than the time it’ll take to run git fetch, so it’s a reasonable goal to try to get as close to that speed as possible.

R10K tries to optimize for this sort of situation by allowing you to run very fine grained deployments. If you want to update a single environment, then run r10k deploy environment <environment name>. Likewise, if you want to update a single module you can run r10k deploy module <module name>. This way you can deploy exactly what you need to deploy and nothing more, and R10K will be able to complete that within a few seconds.

If you want to make sure that everything is up to date, then just run r10k deploy environment --puppetfile, background it, and let R10K go crazy in the background. But if you need fine grained control and need quick results, R10K gives you all the tools you need.

Handling Failure

As mentioned earlier, handling failure has to be integral when you’re deploying Puppet code.

If some part of your infrastructure has exploded and you’re trying to fix it with Puppet, the last thing you need is for an entire deployment to fail because Github or the Forge hiccuped and failed to deploy some trivial module. librarian-puppet was pretty bad about handling failures, because when updating modules it would blow away all the installed modules (already bad) and if it failed part of the way through it would immediately crash, leaving you with a partially deployed environment.

When R10K is running, it tries to handle errors as gracefully as possible. If a critical error occurs when deploying an environment or module, it tries to leave that module in the last good state, and then moves on. This way errors have the smallest impact possible, so when all else is failing, you can keep running Puppet deployments reliably.


Addendum

So why is it called R10K?

It’s called R10K because I’m terrible at names. When I started working on R10K I couldn’t come up with a decent name. While trying to come up with something clever, I recalled that Randall Munroe wrote a bot for controlling IRC chatter, and gave it the most generic name he could think of - Robot 9000. Since I just needed a name, any name, I decided to go with an equally generic name by incrementing the robot index.

And now you know, the rest of the story.

WHY DID IT TAKE YOU THREE MONTHS TO PUBLISH THIS BLOG POST?

So… yeah. About that. I decided that I wanted to get R10K in a more polished state before I wrote the final blog post. I figured a few bugfixes here, a few feature improvements there, no problem right?

I got carried away and reimplemented a good chunk of R10K. It took a lot longer than I expected and it was a pretty serious amount of work, but the end result was worth it. R10K is much more functional, has a lot less bugs, and should be much more stable for future development. Although seriously, screw Hofstader’s law.


Major thanks go to Cory Kolbeck and Joe Rawson. They have been tirelessly reviewing and editing my blog posts, and without these lovely gentlemen, none of my posts would even resemble English.