A few days ago, Andrew Stiegmann commented on a blog post of mine where I shared how we automate our release process with CircleCI. Andrew’s comment can be summarized with “Hey, is there any reason you use CircleCI cache instead of a workspace?”
I read up on a CircleCI blog post that explains the difference between a cache and a workspace. Their diagram does a great job explaining all that:
CircleCI cache vs workspace. Source: CircleCI blog post
Our CircleCI workfow contains of five jobs, each needs access to node_modules
and a bunch of generated files in dist
folders. Our “build job” as outlined in the diagram is where we install all npm packages and generate the files in the dist
folders. We use a monorepo (more about that here), which results in a lot of packages hoisted to the root node_modules
directory.
The main change here is in the .circleci/config.yml
file. First, the [save_cache](https://circleci.com/docs/2.0/configuration-reference/#save_cache)
directive needs to be replaced with [persist_to_workspace](https://circleci.com/docs/2.0/configuration-reference/#persist_to_workspace)
. Likewise, any occurrence of [restore_cache](https://circleci.com/docs/2.0/configuration-reference/#restore_cache)
needs to be replaced with an [attach_workspace](https://circleci.com/docs/2.0/configuration-reference/#attach_workspace)
directive.
In our case, that change alone improved persistence of data by about 60% (from 67 seconds writing to the cache to 27 seconds persisting to a workspace), while restoring it in a subsequent job dropped from 35 seconds with a cache to 12 seconds with a workspace (a roughly 65% performance gain).
node_modules
So far so good, but while I was at it, I dug a bit deeper. For a while now I’ve had an eye on our node_modules
directory size… 🙀 Have you ever checked yours? If so, I’m quite sure you’re with me here. If not, go ahead and check yours, then come back here — I’ll wait.
Heaviest objects in the universe — based on estimates
Alright, now that all readers understand what I mean, let’s continue. In our case, the size of node_modules
is 553 MB. Does it have to be though…? Definitely not! There’s no need to have *.md
files, documentation assets, tests, temporary files, etc. All these files have to be compressed and decompressed when we share it across jobs on CircleCI, regardless of whether we use a cache or a workspace.
I’m aware of two options, so I implemented both and compared them to make an informed decision on which one to choose.
node-prune
TJ Holowaychuk built that for one of his products and open-sourced it at https://github.com/tj/node-prune. It’s a tiny Go command that can be installed in a single line. In our case, it takes 3 seconds to install and run the command. The output of the command looks as follows:
node-prune doing its job on CircleCI
That’s 136 MB of unnecessary files dropped from the node_modules
directory. That’s also 136 MB less data to be compressed and decompressed when passing data from one CircleCI job to the next.
Now, persisting data to a workspace takes 20s, a roughly 70% improvement compared to the original implementation of using a cache. Restoring that data takes 7 seconds, a total of 80% faster than what we had originally.
I wasn’t satisfied with only one option to prune node_modules
. Whenever possible, it’s a good idea to have a few extra data points.
I found yarn autoclean
as documented at https://yarnpkg.com/lang/en/docs/cli/autoclean/. This command cleans up node_modules
as part of the yarn install
command.
Running that resulted in slightly slower performance results compared to node-prune
. yarn install
took about 20 seconds longer (because it runs autoclean
) which is 17 seconds more than what node-prune
requires to get installed and prune the dependencies. Persisting the repository to a workspace took 23 seconds (about 65% faster than compared to a cache) and restoring the workspace was “only” about 70% faster.
This table taken from the PR I opened at work summarizes the results:
CircleCI cache vs workspace
I completely dropped the cache for now, mainly to keep the PR small and manageable. I’m playing with some thoughts to see whether a cache could help speed up multiple workflow runs.
In our case, we save 2.6 minutes for an entire workflow on CircleCI.
If you have similar results, or a completely different experience, I’d love to hear from you. Please leave a comment or clap if you found this interesting (those claps are encouraging to write and share more 😀).