Aaron Kalair

@AaronKalair

Working with the Docker build cache to autoscale our Jenkins nodes

We have a fairly standard Jenkins setup, a primary node and a whole bunch of secondaries that most of the job steps get farmed out to.

These secondary nodes sit doing nothing overnight so we thought it would be a good idea to scale them down in the evening and then back up in the morning so we can use the money we’ve saved to run more secondaries during the day when they’re in demand.

We run various jobs on our Jenkins infrastructure but the most common job is for building and testing our monolith on every push to a PR and merge into master.

This basically involves doing a git checkout of the branch (into existing workspaces on the primary node so we don’t have to do a full initial checkout), building a Docker image that contains all of the dependencies and the version of the codebase at the time the job was ran and then running the tests inside the resulting container (on any of our secondary nodes so we can run lots of jobs in parallel, files are transferred between nodes using Jenkins stashes).

Building this Docker image from scratch takes around 10 minutes and we rely on the Docker image cache on these nodes to reduce our build times by placing the step which normally invalidates the cache (the copy’ing of the updated codebase) at the end of the Dockerfile.

The Dockerfile looks something like:

FROM ubuntu:14.04
# Things like apt-get installs, these take a decent amount of time
<some bash commands>
# These live in our main codebase along side the code and basically never change
<copy in some config files>
# There's a fair few of these, they take a while but rarely change
<copy some requirements.txt files>
<some pip installs>
# This is where we expect the cache to be invalidated and copy'ing in the files doesn't take long
<copy in the version of the codebase that's being built>

So if we want to autoscale our secondary nodes we need to make sure that they have the image cache before jobs start running.

Attempt #1 was to simply pull down a copy of the image from our registry in the Cloudformation userdata when the node comes up.

This failed because Docker wont use images pulled from a registry as part of the build cache.

This is considered a feature — https://github.com/moby/moby/issues/20316

Attempt #2 was to replace pulling the Docker image, with cloning of our repository into /tmp and then building the Docker image from that version of the codebase when the node first comes up. This would give us a copy of the Docker image from a recent version of the codebase along with the build cache we needed to speed up subsequent builds.

This failed, during the first run of the Jenkins job the cache was invalidated at the first COPY step when we copy over the config files even though the contents of them hadn’t changed.

This was slightly confusing because I thought that Docker only invalidated the cache at a COPY line if the file had changed and it hadn’t. Some digging around revealed that Docker also uses the mtime value on a file when considering if it has changed or not.

https://github.com/moby/moby/issues/4351#issuecomment-76222745

So what was happening was that the git checkout of the codebase in the Jenkins job created files with different mtimes to that of the files from our checkout of the codebase into /tmpat node creation time and so Docker considered the config files we copy in early on to be changed and invalidated the cache earlier than we wanted.

We didn’t see this issue before as we do the git checkout part of the job on our primary node into workspaces which are never deleted and so after an initial checkout these files are never touched and Docker never considers the early COPY steps to have been invalidated.

We can replicate this behaviour with a simple example:

We have a project with two files a Dockerfile and a text file

vagrant@vagrant-ubuntu-trusty-64:~$ ls
Dockerfile test

The Dockerfile just copies in our file

vagrant@vagrant-ubuntu-trusty-64:~$ cat Dockerfile
from ubuntu:16.04
COPY test /srv/test

And lets take a note of the mtime of our test file for later on

vagrant@vagrant-ubuntu-trusty-64:~$ stat test
File: ‘test’
Size: 0 Blocks: 0 IO Block: 4096 regular empty file
Device: 801h/2049d Inode: 140134 Links: 1
Access: (0664/-rw-rw-r--) Uid: ( 1000/ vagrant) Gid: ( 1000/ vagrant)
Access: 2017-12-31 18:03:33.995881878 +0000
Modify: 2017-12-31 18:03:33.995881878 +0000
Change: 2017-12-31 18:03:33.995881878 +0000
Birth: -

Finally if we build it

vagrant@vagrant-ubuntu-trusty-64:~$ sudo docker build .
Sending build context to Docker daemon 13.82 kB
Sending build context to Docker daemon
Step 0 : FROM ubuntu:16.04
16.04: Pulling from ubuntu
Step 1 : COPY test /srv/test
---> 3be8d8c094bf

We don’t have that step cached and so Docker just runs it.

If we modify test

vagrant@vagrant-ubuntu-trusty-64:~$ echo 'hello' > test
vagrant@vagrant-ubuntu-trusty-64:~$ sudo docker build .
Sending build context to Docker daemon 14.34 kB
Sending build context to Docker daemon
Step 0 : FROM ubuntu:16.04
---> c1ea3b5d13dd
Step 1 : COPY test /srv/test
---> 75bb69e528c3
Removing intermediate container 6771d1e1f191
Successfully built 75bb69e528c3

Docker picks up the change and invalidates the cache layer, nothing surprising here.

Now what if test came from a git repo ?

I bundled these two files into a git repo, pushed them up then pulled the repo down to a seperate folder

vagrant@vagrant-ubuntu-trusty-64:~$ mkdir git
vagrant@vagrant-ubuntu-trusty-64:~$ cd git/
vagrant@vagrant-ubuntu-trusty-64:~/git$ git pull git@github.com:AaronKalair/test.git

And does Docker use the cache we created with the earlier docker buildfor the file?

vagrant@vagrant-ubuntu-trusty-64:~/git/test$ sudo docker build .
Sending build context to Docker daemon 47.62 kB
Sending build context to Docker daemon
Step 0 : FROM ubuntu:16.04
---> c1ea3b5d13dd
Step 1 : COPY test /srv/test
---> f4fe0a287838
Removing intermediate container 20c77a3dee81
Successfully built f4fe0a287838

Nope!

Are the files identical?

The one from the git clone

vagrant@vagrant-ubuntu-trusty-64:~/git/test$ md5sum test
a10edbbb8f28f8e98ee6b649ea2556f4 test

The original

vagrant@vagrant-ubuntu-trusty-64:~$ md5sum test
a10edbbb8f28f8e98ee6b649ea2556f4 test

Yep, there identical.

What about the mtime?

vagrant@vagrant-ubuntu-trusty-64:~/git/test$ stat test
File: ‘test’
Size: 7 Blocks: 8 IO Block: 4096 regular file
Device: 801h/2049d Inode: 262341 Links: 1
Access: (0775/-rwxrwxr-x) Uid: ( 1000/ vagrant) Gid: ( 1000/ vagrant)
Access: 2017-12-31 19:37:05.348272796 +0000
Modify: 2017-12-31 19:37:05.348272796 +0000
Change: 2017-12-31 19:37:05.348272796 +0000
Birth: -

Nope its changed (it was 18:03 originally), the git pull sets the modified, access and change time to the time of the clone.

What about existing files does a git pull affect those?

If we add another file to the project push it up …

vagrant@vagrant-ubuntu-trusty-64:~$ touch test_test
vagrant@vagrant-ubuntu-trusty-64:~$ git add test_test
vagrant@vagrant-ubuntu-trusty-64:~$ git commit
vagrant@vagrant-ubuntu-trusty-64:~$ git push origin master

and then pull it back down into the other git checkout …

vagrant@vagrant-ubuntu-trusty-64:~$ cd git/test/
vagrant@vagrant-ubuntu-trusty-64:~/git/test$ git pull
vagrant@vagrant-ubuntu-trusty-64:~/git/test$ stat test
File: ‘test’
Size: 7 Blocks: 8 IO Block: 4096 regular file
Device: 801h/2049d Inode: 262341 Links: 1
Access: (0775/-rwxrwxr-x) Uid: ( 1000/ vagrant) Gid: ( 1000/ vagrant)
Access: 2017-12-31 19:40:05.032278718 +0000
Modify: 2017-12-31 19:37:05.348272796 +0000
Change: 2017-12-31 19:37:05.348272796 +0000
Birth: -
vagrant@vagrant-ubuntu-trusty-64:~/git/test$ stat test_test
File: ‘test_test’
Size: 0 Blocks: 0 IO Block: 4096 regular empty file
Device: 801h/2049d Inode: 262359 Links: 1
Access: (0664/-rw-rw-r--) Uid: ( 1000/ vagrant) Gid: ( 1000/ vagrant)
Access: 2017-12-31 20:36:22.916378925 +0000
Modify: 2017-12-31 20:36:22.916378925 +0000
Change: 2017-12-31 20:36:22.916378925 +0000
Birth: -

The existing test file is untouched and the new file appears with all the timestamps set to the time of the git pull

Ok so that explains the behaviour we’re seeing and why we’ve never had this problem before but we still can’t autoscale our secondary Jenkins nodes without fixing the caching issue.

So attempt #3 that turned out to be successful was to:

Create a base image that we use in FROMat the start of the main Dockerfile which already has ran the expensive steps and have that pulled down to the nodes when they spin up.

This would then effectively replicate having the expensive steps in the build cache.

So we have a nightly build from a Dockerfile that looks like:

# NIGHTLY IMAGE
FROM ubuntu:14.04
<some bash commands>
<copy in some config files>
<copy some requirements.txt files>
<some pip installs>

With our Jenkins job now using a Dockerfile that looks like:

FROM NIGHTLY_IMAGE
<copy in some config files>
<copy some requirements.txt files>
<some pip installs>
<copy in the version of the codebase that's being built>

Now all the expensive commands that we wanted cached (initial bash commands and pip installs) are in the nightly image which we pull to an instance when its created and Jenkins uses it in the FROMline.

The repeated pip installs in the Dockerfile Jenkins uses for the job ensures that if the requirements change before the next nightly build those changes are reflected in the built image.

If the nightly image build uses versions of the files with different mtimes then it doesn’t matter as a pip install with the requirements already there only takes a few seconds.

The nightly image is built as a Jenkins job every night at 2am and cron jobs on the sidekicks as well as autoscaling operations ensure they have the latest nightly image every day.

And there we have it, we now spend less money and have more Jenkins capacity during the day when we need it!

Follow me on Twitter @AaronKalair

More by Aaron Kalair

Topics of interest

More Related Stories