At Checkly, we run our browser checks on AWS EC2 instances managed by Terraform. When shipping a new version, we don’t want to interrupt our service, so we need zero downtime deployments. Hashicorp has their own write up on zero downtime upgrades, but it only introduces the Terraform configuration without a lot of context, workflow or other details that are needed to actually make this work in real life™.
This is the full lowdown of how we do it in production for ~1.5 million Chrome-based browser checks since launch.
For those less initiated into “infra structure as code” and “immutable infrastructure” let’s look at the problem a bit closer. You will see that you have to build your app in a specific way and have some specific middleware (i.e. queues) in place to benefit from this approach. Skip this if you are a grizzled veteran.
You can chop this problem a bunch of parts. Some are Terraform related, some are not, but they all need to be in place before you can pull this off without annoying your users.
For Checkly, the app in question can be defined as a “worker”. The workers processes incoming requests based on a queue of work. More on this in the architecture section. For now, let’s look at we want of our worker and of our worker deployment process:
Architecture problems
Deployment problems (solved by Terraform)
Operations problems
Some of these problems should be addressed during the roll out, where other should be addressed in your application architecture.
Some of the problems raised above can be tackled by following a typical fan-out / fan-in pattern or a master / worker pattern. This works especially well for the Checkly use-case because users do not interact directly with the workers.
In Checkly’s case, the architecture is as follows:
done()
callback which deletes the message from the queue.done()
callback is never called or is called with an error done(err)
. This triggers SQS specific behaviour where the message becomes visible again in the queue and other workers can pick it up. This is key, as we are now free to kill a worker without missing any work. This is an SQS function but comes with almost any queueing platform.Applying this pattern solves our architecture related problems to the pattern’s inherent load balancing and decoupling attributes. Of course, this pattern also allows for pretty easy scaling. More messages === more workers.
Moreover, this also allows for some measure of auto scaling based on load characteristics like the amount of messages in a queue (and their relative age) or the 1m, 5m and 15m load average of the EC2 instances. The reason for this is that scaling up is easy, but scaling down without annoying users or impacting your service is a lot harder. Solving this issue for deployments also solves it for auto scaling. Two birds with one stone.
For anything even remotely stateful or interactive (i.e. API / Web servers with session state, data sources etc.) this pattern is pretty much a nogo without something like request draining, sticky session based routing or a central session storage.
As mentioned earlier, Terraform provides you with two primitives to do zero downtime deployments.
create_before_destroy
flag in the lifecycle
configuration block. Kinda speaks for itself. You can’t kill the existing servers before the new ones are up.local-exec
or remote-exec
provisioner. This executes a command. When it return, Terraform continues its plan execution.As you’ll find out, you need a bunch of other things to pull this off over multiple regions. Let’s look at an aws_instance
configuration in a .tf file for Checkly.
// workers/module.tf
resource "aws_instance" "browser-check-worker" {ami = "${data.aws_ami.default.id}" // AWS Linux AMIinstance_type = "${var.instance_type}"count = "${var.count}"tags {Name = "browser-check-worker-${count.index}"Version = "0.9.0",Env = "${var.env}" // prod or test}user_data = "${var.user_data}" // User data pulls & starst the app
key_name = "checkly"
lifecycle {create_before_destroy = true}
// Every 5 seconds, check if the launcher.js process is up.
provisioner "remote-exec" {inline = ["until ps -ef | grep [l]auncher.js > /dev/null; do sleep 5; done"]
connection {
type = "ssh"
user = "ec2-user"
private\_key = "${file("~/.ssh/checkly.pem")}"
}
}}
Some take-aways from this file:
user-data.yml
file that bootstrap the application. See more details below.The payoff is in using the remote-exec
provisioner (that uses the SSH key). It checks every 5 seconds if the launcher.js
process is running. Note we use the grep [l]auncher.js
syntax to exclude the grep command itself from the process listing. Not doing this would instantly return this command and defeat the whole purpose.
Admittedly, this is fairly simplistic, but for our use case it does exactly what is needed. The existence of the launcher process means we have our code running and it is ready to read new messages from the SQS queue.
To fully grasp this, we need to look at a user-data file.
#cloud-configpackages:
write_files:
path: /root/.profileowner: root:rootpermissions: '0644'content: |
if [ "$BASH" ]; thenif [ -f ~/.bashrc ]; then. ~/.bashrcfifi
export NODE_ENV=productionexport AWS_REGION=ap-south-1export WORK_QUEUE=https://sqs.ap-south-1.amazonaws.com/xxxx/checksexport RESULTS_QUEUE=https://sqs.ap-south-1.amazonaws.com/xxxx/resultsruncmd:
service docker start
[., /root/.profile]
[docker, login, -u, checkly, -p, "pwd"]
[docker, pull, "checkly/browser-checks-launcher:latest"]
. /root/.profile && docker run -d -e NODE_ENV=$NODE_ENV ... checkly/browser-checks-launcher:latest
docker
package a requirement. The AWS AMI we use has Docker preinstalled, but who knows….profile
file that contains the necessary environment variables our workers need to operate like the addresses of the two queues it communicates with, what region it is serving and in what environment it is working (production or test).docker run
the image, passing in all the environment variables.Remember, none of this marks our new instance as “finished” from a Terraform perspective. The remote-exec
command only returns after the docker container is fully running and has started the relevant node process.
The result looks as follows:
Note that this process is fairly generic. You can do the same thing for any Dockerized app or any Ruby, Python, Java, whatever app.
The AWS multi region Terraform configuration is very specific to how AWS manages naming, resources, access etc. per region. We also have Terraform configurations for Digital Ocean and they make it a lot simpler to pull this off. We make use of the Terraform strategy described in this blog post.
The main thing to grasp is that for each AWS region, you create a module in your main.tf
file, and reference the a template using the source
attribute.
// main.tf
module "workers-us-east-1" {source = "workers"region = "us-east-1"count = 1user_data = "${file("user-data-us-east-1.yml")}"env = "prod"}
module "workers-us-west-1" {source = "workers"region = "us-west-1"count = 3user_data = "${file("user-data-us-west-1.yml")}"env = "prod"}
This leverages Terraform’s module hierarchy and allows you to fly in different variables for different regions. More importantly, it enables to deploy to multiple regions with one command.
But why create new instances anyway? The worker is published as a Docker container, can’t we just pull a new container, cycle the old one and be done with it? Yes, that works. We use it during development all the time. However, for production we want to be sure configuration hasn’t drifted due to manual intervention
After setting all of this up, how do we release a new version of our worker? For Checkly, the steps are as follows:
First, we build a new Docker container. Tag it as latest and push it to our private repo. Nothing special here.
Secondly, we update the version in our module.tf
file.
We then use the Terraform taint
command to force a create/destroy cycle. Why? Because Terraform has no way of knowing that we want to pull a new container. Just bumping the version is not sufficient to trigger a replacement of the EC2 instance.
terraform taint -module=runners-us-west-1 aws_instance.browser-check-worker
Notice that the -module
targets an AWS region as per the module declarations in the main.tf
file. From here, it is a straightforward terraform plan
and/or terraform apply
and ✨ behold the zero downtime magic. ✨
It doesn’t take a genius to see that this can be turned into a script that runs on a CI/CD platform pretty easily. It follows the general pattern of:
Where each stage can break and return control to the operator. Your CI/CD platform will probably shoot you an email when that happens.
All of the above might fail. Reasons for failure might be as obtuse as AWS changing some API (breaking Terraform), your buggy code or just it being Friday afternoon. In general, failures fall into two camps:
Note that this only works because our code explicitly calls the AppOptics API on each run, together with some basic details on what region the worker is running in, using the following code.
const axios = require('axios')
axios.defaults.baseURL = `https://${config.appOptics.apiToken}@api.appoptics.com/v1`
const namespace = 'checkly'
const trackRunCount = function () {const payload = {tags: {region: process.env.AWS_REGION || 'local'},measurements: [{name: `${namespace}.browser-check-worker.count`,value: 1}]}}
This is, again, specific to our situation as users don’t interact with the workers directly. In a typical client/web server scenario the amount of 500 errors, or the lack of 200 response codes could function as a similar trigger. In the end, you need to establish your app is processing user requests successfully, regardless of whether your deployment was successful.
Terraform also provides a ton of monitoring providers that you can hook into your deployment routine. If your particular monitoring solution is there, use it.
Doing zero downtime deployments with Terraform without causing service disruption is a bit more involved than just using the right Terraform commands and configuration. Takeaways are:
Originally published at checklyhq.com.
P.S. If you liked this article, please show your appreciation by clapping 👏 below and follow me on Twitter! But wait, there’s more!