Jonah Horowitz

@jonahhorowitz

Configuration Management is an Antipattern

Configuration is an Antipattern talk from Scale15x

Release Engineering

When I got my first job at a tech startup, this was literally our release procedure. I basically copied our code from cvs to production and restarted each webserver.

Eventually, we hired a few new Systems Administrators and moved the process into a bash script.

Now with my bash script, I could run a little ssh for loop, and push code out to all of the servers pretty quickly. This script eventually became >1000 lines of Perl with a web interface.

Server Installation

So, that was how we deployed code, how about how we managed server installs before configuration management?

It’s funny before we had good online update mechanisms for our servers, you actually had a consistent build across your systems, but only because you never updated anything.

You’d think we’d gotten away from this but, in fact, here’s the more modern version for cloud infrastructure.

If you do this well, you can have every server in production running a slightly different version of your operating system packages.

Configuration Management

A few startups after that first e-commerce site, I started working at a Looksmart with Nate Campi, who wrote a book on automating Linux. Looksmart was the first time I’d ever encountered a real configuration management system, and it was a revelation.

When I got there, Nate had automated about half of the infrastructure with cfengine, and over the course of the next couple years, we would get to the point where the infrastructure was 100% run by code. Every server in the entire company (not including the Oracle database servers, and more on that in a moment) could be reimaged in a couple of hours.

Eventually, I left that startup and went to another one. When I was interviewing, I always asked what kind of configuration management software they were running because I absolutely was not going back to another startup that wasn’t running automation.

At the company I eventually joined, one of the people interviewing me said they were running Chef and really happy with it, and another interviewer said they were running Puppet. That was probably a red-flag, but I joined them anyway. It turns out they were running both, but over the course of a few months, we consolidated on Puppet.

You might ask why we chose puppet over chef, but it wasn’t a particularly complicated decision. We had more of the infrastructure already automated in puppet, so there was less chef code to replace.

If configuration management was so revolutionary and empowering, why shouldn’t you use it?

From what I’ve seen there are two modes of running config management. In the first mode, you’ve got an operations team that bottlenecks all changes in production. They’re the only ones with commit access to the configuration management repository, and they have to make all changes in production. This is the anti-DevOps way of doing things.

The other option is that you expect developers to run configuration management on the clusters that they are responsible for. This is the much more DevOps way of doing things, but it suffers from a different problem. Now you have to teach all your engineers the DSL of your configuration management software of choice, and depending on how you deploy your code to production, every developer now has the power to take down your whole system with a poorly written configuration change. I once had an engineer kill off everything owned by root — with the exception of init — on a 4000 node cluster — including sshd, rsyslogd, and most frustratingly crond — which prevented us from being able to fix the problem using our cron-triggered configuration management tool.

So, there is a way to fix that problem, of course, you just have a separate configuration management branch for every cluster you run in production, and restrict developers to running code on only the systems they manage. Now they can only shoot their own team in the foot, but great, now how do you manage the common infrastructure code? Maybe you use git submodules? It becomes a hairy mess really quickly.

And you still have another issue, which is out of sync configurations. Anyone who’s run configuration management at scale has run into this issue. At any given time there’s some percentage of your fleet that is not up to date. That’s for many reasons, either not all the servers run the configuration management tool at the same time, or because broken networks, buggy code, bad configuration pushes, you configuration server is down, or whatever.

To solve this problem, you end up writing a bunch of error catching/correcting code to handle all the ways your configuration management tool can fail. Then you write a monitoring alert that triggers when a server gets too far out of date, and no matter how hard you try, you’ll still have unpredictable bits of your infrastructure that aren’t covered by your configuration management or your error checking code.

See, configuration management promises that you’ll know the complete state of your infrastructure, but it never works that way.

Then, and this isn’t unique to configuration managed environments, but it enables it. Every knows that one server. That one server that’s super important, but nobody has gotten around to automating yet. That one server, that’s a single point of failure. That one server that Bob, who now works at Hooli, setup, and nobody knows how to rebuild? Yeah, that one server!

Still, none of this addresses the problem from the beginning of my talk, which is that release engineering still sucks. Configuration management doesn’t really work for release engineering. It can be tortured into the service of release engineering, but in most cases, you end up running some other tool on top of it. That doesn’t mean I haven’t tried. Update this release version for a gated set of the cluster then check that code in, let it run, and then un-gate another part of your infrastructure, oh, and try to automate those changes to your configuration management code, so you don’t screw it up when doing an emergency bug fix. It’s terrible.

What’s the alternative?

The alternative is immutable infrastructure. If you’re running in the cloud, this means baking your AMI with your application code already on it. See, imagine a world where you finish writing the code, you push it into git, and it gets built into an RPM or Debian package. That package is installed on a Base-AMI that was carefully crafted by your security and performance engineers, and then an AMI, with your software installed, along with its dependencies is pushed out to your AWS regions.

But you say to me, well, we’re not running in AWS. We have our own private cloud infrastructure. Well, you can do almost the same thing with docker. In fact, docker makes it even easier. Docker images are immutable to start with.

As a side note, don’t run configuration management inside your docker images. Seriously. If you’re doing that, you’re doing it wrong.

Image Creation

Start with a base, or foundation image — this is either your handcrafted, optimized image or just the default image from an upstream vendor. At Netflix, this was built by the performance engineering team, with input from the security team. Your base image should have the latest security updates as well as any base infrastructure packages that are run platform-wide. Things like your monitoring packages, or your service discovery. Now, you could use a configuration management tool to build your base image, but you could also use OS packages and a little python to install and configure your base image.

Now, once you have that base image, you’ll probably want to canary it with a smaller/less-critical applications before releasing it to the rest of the org. At Netflix, we built/promoted the base AMI every week, but we also had a way to push security updates on a faster release cycle when needed.

Once you have that base image, you install your application and its dependencies on the base image using your standard package manager (like apt-get or yum). If you have your dependencies configured in your package correctly, this is basically one step.

You compile that into a new application specific image, push that to all your was regions, and voilà! Immutable infrastructure!

Tools

Let me just briefly talk about some of the tools that you need to make this work.

  • First, if you’re going to use OS packages, you need a quick and easy way to build them. Netflix uses Gradle for this. I recommend it.
  • Next, you need a system to build your images. Again, Netflix uses Aminator, but you could use Docker or Packer.
  • You need a deployment system like Spinnaker, Terraform or CloudFormation.
  • If you want to deploy the same images in test and prod, and you should, you need service discovery like Eureka, Zookeeper or even just internal ELBs.
  • Finally, and this is optional, but I want to mention that having a way of doing dynamic configuration with feature flags can allow quick changes without the need for a re-bake/redeploy.

Benefits

It totally simplifies your operations. You no longer have to know the state of your currently running servers before releasing a new version. You no longer have to think about how to move from one state to another, and if you servers are broken, you don’t have to fix them (or log into them one at a time to restart crond). Have any of you spent hours or days trying to get the puppet Augeas code to work? Immutable means you never have to look at Augeas again, and that’s worth it just for its own sake.

This enables continuous deployments because new code just goes through your pipeline and you don’t have to deal with old versions of libraries or dependencies or configuration that might have been left around.

You can quickly start up new instances of your software when you need to scale. I’ve seen config management environments where it takes 4 hours from when an instance first launches before it’s ready to take traffic. That’s probably a pathological case, but it can easily be an hour. It’s hard to use reactive autoscaling if you have to wait an hour for new instances to come up. It’s also hard to recover from failure. If one of the machines in your cluster dies, is killed by chaos monkey, or rebooted because your cloud provider kills the underlying instance, you need to be able to start up a new one quickly. If you think back to CS1, this is a lot like how we talk about optimization during compile time. You’re going to execute the code over and over, put the optimization in there at the stage that only runs once, and take advantage of the startup/run speed.

In addition, your configuration is always in sync across your nodes, since they were all launched at the same time, from the same image — no more worrying about that one node where Chef crashed halfway through. You also don’t have to worry about cruft building up in dark corners of your systems. If one of your nodes is acting weird, just kill it and start a new one.

You deploy your same image to dev, test, and prod, so you can trust the systems to behave the same in each environment.

It’s easier to respond to security threats because you’re used to replacing all of your images in production, so all you have to do is update your base image, and run a new push. No need for kernel-upgrade reboots because your nodes boot from a clean/upgraded state. Also, in the event one of your nodes was compromised, you might limit the time an attacker can persist inside your network.

It makes multi-region operations easier because you run the same image everywhere.

And that one server, well that one server is going to be really obvious. In fact, if you’re running chaos monkey, or your relying on AWS to be your chaos monkey, you’re going to lose that one server sooner or later anyway.

Release Strategies

Moving to immutable allows you to take advantage of some cool release strategies. The first one is a rolling release, where nodes are replaced one at a time. This can be handy when you have state on your cluster that needs to be preserved.

My favorite release strategy is the Blue/Green push. You have 100 nodes, you start up 100 new nodes, and then you move traffic to the new nodes. If you have issues, you can quickly move traffic back to the old nodes. After a reasonable window, say 1–3 hours, you shutdown the old nodes.

Caveats

So, of course, there are a couple of caveats. If you are running your own bare-metal infrastructure, you still need to manage the base operating system, but this should be a small team, and you should make your base OS as small as possible. This reduces the attack surface externally, and it also reduces the risk of a bad change taking down your systems internally.

Databases

So, back at the beginning, I mentioned quickly that our CFEngine setup couldn’t reimage our database servers. Now, back then, we were running old-school Oracle on big Sun Hardware, but it’s a story I still hear from people running Mongo and Cassandra. Sure — we can run immutable, but not on our database nodes, and to that I say — you’re not trying hard enough. As long as the on-disk format is kept between db versions, you can keep your database on EBS (or mount-point in the case of Docker), and you now get the ability to dynamically relaunch your database using immutable images. It’s even possible using relational databases if you have scripted failover between your primary and standby instances. It is possible, I’ve seen it done.

Configuration management had its day. It did change the way we managed our infrastructure and allowed us to scale our infrastructure in ways we never had before. The advantages it brought us in reliability and consistency should not be understated, but it’s also a technology, that like the hand-crafted Perl scripts that came before it, who’s time has come.

With the adoption of the cloud and application containers, we can now do better than configuration management; we can run with immutability. This will make infrastructure more consistent, more reliable, more secure and more scalable than ever before.

More by Jonah Horowitz

Topics of interest

More Related Stories