Persist ~ 70% faster; restore ~ 80% faster, your mileage may vary
A few days ago, Andrew Stiegmann commented on a blog post of mine where I shared how we automate our release process with CircleCI. Andrew’s comment can be summarized with “Hey, is there any reason you use CircleCI cache instead of a workspace?”
I read up on a CircleCI blog post that explains the difference between a cache and a workspace. Their diagram does a great job explaining all that:
Our CircleCI workfow contains of five jobs, each needs access to
node_modules and a bunch of generated files in
dist folders. Our “build job” as outlined in the diagram is where we install all npm packages and generate the files in the
dist folders. We use a monorepo (more about that here), which results in a lot of packages hoisted to the root
Migrate from a cache to a workspace
The main change here is in the
.circleci/config.yml file. First, the
save_cache directive needs to be replaced with
persist_to_workspace. Likewise, any occurrence of
restore_cache needs to be replaced with an
In our case, that change alone improved persistence of data by about 60% (from 67 seconds writing to the cache to 27 seconds persisting to a workspace), while restoring it in a subsequent job dropped from 35 seconds with a cache to 12 seconds with a workspace (a roughly 65% performance gain).
Bonus improvement — prune
So far so good, but while I was at it, I dug a bit deeper. For a while now I’ve had an eye on our
node_modules directory size… 🙀 Have you ever checked yours? If so, I’m quite sure you’re with me here. If not, go ahead and check yours, then come back here — I’ll wait.
Alright, now that all readers understand what I mean, let’s continue. In our case, the size of
node_modules is 553 MB. Does it have to be though…? Definitely not! There’s no need to have
*.md files, documentation assets, tests, temporary files, etc. All these files have to be compressed and decompressed when we share it across jobs on CircleCI, regardless of whether we use a cache or a workspace.
I’m aware of two options, so I implemented both and compared them to make an informed decision on which one to choose.
TJ Holowaychuk built that for one of his products and open-sourced it at https://github.com/tj/node-prune. It’s a tiny Go command that can be installed in a single line. In our case, it takes 3 seconds to install and run the command. The output of the command looks as follows:
That’s 136 MB of unnecessary files dropped from the
node_modules directory. That’s also 136 MB less data to be compressed and decompressed when passing data from one CircleCI job to the next.
Now, persisting data to a workspace takes 20s, a roughly 70% improvement compared to the original implementation of using a cache. Restoring that data takes 7 seconds, a total of 80% faster than what we had originally.
I wasn’t satisfied with only one option to prune
node_modules. Whenever possible, it’s a good idea to have a few extra data points.
yarn autoclean as documented at https://yarnpkg.com/lang/en/docs/cli/autoclean/. This command cleans up
node_modules as part of the
yarn install command.
Running that resulted in slightly slower performance results compared to
yarn install took about 20 seconds longer (because it runs
autoclean) which is 17 seconds more than what
node-prune requires to get installed and prune the dependencies. Persisting the repository to a workspace took 23 seconds (about 65% faster than compared to a cache) and restoring the workspace was “only” about 70% faster.
This table taken from the PR I opened at work summarizes the results:
I completely dropped the cache for now, mainly to keep the PR small and manageable. I’m playing with some thoughts to see whether a cache could help speed up multiple workflow runs.
In our case, we save 2.6 minutes for an entire workflow on CircleCI.
If you have similar results, or a completely different experience, I’d love to hear from you. Please leave a comment or clap if you found this interesting (those claps are encouraging to write and share more 😀).