Puppet, the Catalog, and You

If you’ve been using Puppet for any meaningful amount of time, you’ve heard the term ‘catalog’ thrown around. The official glossary defines the catalog as “a compilation of all the resources that will be applied to a given system and the relationships between those resources.” Well sure, that’s great, apparently the catalog holds a set of resources for a host. What does it look like? Is it a simple array, or perhaps a hash? How is the data stored? Is it a dump truck? Is it a series of tubes?

The catalog is a very important part of puppet, but it’s a very big topic and has a number of touch points into the rest of the system. This post is going to be a high level view of the catalog, what it is, and how it works.

From manifests to a catalog

When dealing with Puppet, you’ll find people throw around terms like catalogs and graphs and assume that you follow along, because they’ve been using Puppet long enough that these terms are completely familiar to them. (Alternately, they double majored in Computer Science and Mathematics, with a minor in pain.) But coming from the outside with a non-CS background, these terms can be as clear as mud.

So, from the beginning, when writing a manifest, you’ll end up with something like this:

class boardie {

  require ruby::dev
  require build
  require build::sqlite

  sinatra::app { 'boardie':
    giturl   => 'https://gitbox.local:boardie',
    revision => 'e1f262dd5e05df8bb02bec461c54835db0da1e94',
    rackup   => "puppet:///modules/boardie/config.ru",
  }

  $config = "/opt/boardie/config/config.yml"

  file { $config:
    ensure  => present,
    content => template('boardie/config.yml.erb'),
    owner   => 'www-data',
    group   => 'www-data',
    require => Sinatra::App['boardie'],
  }

  bundler::install { $root:
    require => [
      Sinatra::App['boardie'],
      File[$config],
    ],
  }
}

Like the majority of programs out there, in order for this manifest to mean anything, it needs to be transformed into some sort of representation that Puppet can actually operate on. We’ve got a number of manifests being required, and there are resource dependencies between Sinatra::App[‘boardie’], File[$config], and Bundler::Install[$root]. We need to take this resource and convert it from the above manifest into something that looks like this:

Puppet catalog example

(Note: I’ve simplified this a bit to keep things sane.)

This would be the catalog that represents the above manifest. The catalog serves to hold a set of resources, and it stores the relationships between them. To do this, the catalog stores all of this information in a data structure called a graph. So in effect, the catalog is a graph.

Well wait, what does this mean?

So why bother with graphs?

So we know that we take in manifests and turn them into catalogs, but why is this effort worth it? I mean come on, we’ve got servers to run, so could we please ditch the graph theory and get some systems running? But at the heart of it, any sort of system configuration at some level looks like a graph.

We have individual resources - files, services, users, etc. These can interrelate in fairly complex ways, and they don’t fall into a sequential order.

Some things have a lot of dependencies. Say you’re running a rails app, if you’re doing it right you’ll need apache, a database, firewall, users, logging, and so forth. You need to bring up apache before passenger, the user that rails will run as needs to exist before the app is fired up, and the database needs to exist before you try to do a database migration. In this example, you have this interlaced set of dependencies, but there isn’t an exact step by step ordering for this, nor does there need to be.

Other tasks have no dependencies. If you’re installing vim, then that’s probably all that needs to be done. Something like this will be quite sufficient:

package { 'vim-enhanced':
  ensure => present,
}

So we have these different resources, and there’s these unidirectional relationships between them. That’s basically a graph, defined by wikipedia as follows:

A graph is an abstract representation of a set of objects where some pairs of the objects are connected by links.

Catalogs all the way down

As the post on pluginsync mentioned, catalogs can be used for a more than configuring your system - they can be used to configure Puppet itself. This is a good example of the Unix philosophy of full exploitation - if something works, then keep using it!

Added resiliency through dependencies

Alright, so apparently systems are graphs, and that’s how Puppet represents a system. What does this buy us?

Well, if you’re explicitly saying how things are related, via require, subscribe, and friends, you get more resilient code. Let’s say you have a node definition like this:

node webserver {
  include apache
  include mysql
  include sinatra
}

Suppose that apache fails to install correctly - perhaps you had a typo in an apache config file, so the service refuses to start. Should MySQL fail to install as well? Should the installation of Sinatra gems? They’re not tightly coupled, they can be installed on their own, so should one failure propagate to unrelated resources?

Managing failed resources

Now these are generally pretty quick to install, so perhaps we could broaden the scope of things.

node downloadserver {
  include apache
  downloads::sync { '/srv/bigfiles': }
}

Say that you provision a new machine, sign its certs, fire off puppet, and then disconnect - after all, syncing those big files is going to take $LONGTIME. You come back a few hours later, and oshi - apache typo.

With Puppet, you don’t need to worry about this thing. If something fails, Puppet will try to work around it and configure the system as best it can. This is actually pretty snazzy and means that Puppet can continue to enforce state in the face of partial degradation.

Ordering the evaluation

As noted on the Puppet Labs blog a while back, Puppet 2.7 is actually deterministic. That is, if you apply one catalog to a host multiple times, the resources will be applied in the same order. Wait, didn’t we just say that order doesn’t matter? WHAT MADNESS IS THIS?

Puppet is single threaded, so it’ll only evaluate one resource at once. If it tried evaluating multiple resources at once, it would probably make my head explode, because concurrency is REALLY hard. If we can only evaluate one thing at a time, we’ve got to basically run over the graph like a list. With this in mind, we might as well be consistent on how the graph is applied. Puppet achieves this by taking the SHA1 hash of the resource titles and orders those hashes alphabetically.

What this means is that we get the best of both worlds. We get consistent ordering, but we get the full power of dependencies and the resiliency that provides.

Examining graphs

Puppet can convert catalogs into dot files, which is a format for storing graphs. You can use tools like Omnigraffle or Graphiz to convert graph files ( produced with puppet agent --graph and stored in $vardir/state/graphs) to produce an image of the graph that’s pretty easy to read. You can also use gephi to manipulate and explore graphs, view how things interrelate, and do some pretty neat transforms on the data.

Gephi can also produce pretty pictures, and they’re far prettier than simply using Omnigraffle or Graphviz. I took the graph from one of our scarier boxes that uses a LOT of exported resources and popped it into Gephi. The nodes are colored by the number of dependencies they have on them. The massive cluster in the lower right is composed of nagios check resources, and they all depend on Class[nagios::server] so you can identify what’s going on there pretty simply.

Puppet graphs with gephi

Neat, huh?

The gory bits - converting manifests to a catalog

I’m going to gloss over how manifests are converted to catalogs, because Masterzen is so damn awesome that he already wrote posts on the parser and compiler.


This should explain a little about how Puppet stores resources and how things are ordered, why graphs are used to represent the system, and how it makes a difference. At a later date I’ll go through and explain the actual API that the catalog provides.