We think of bugs as being in code but that’s just the end of the story. Bugs are a human problem. You fix one bug in code but you prevent future bugs by helping humans to work better.
That’s the principle behind much of the DevOps movement. Start testing as early in the dev cycle as possible and then keep testing until the code is in production. We see it in practice as continuous integration (CI): commit, test, and merge code frequently rather than building up huge unmergeable branches.
It’s likely that you’re already practising CI but there are always one or two things you could be doing better.
We’ve all been there. The test suite passes. Your local branch is green. Your colleague has reviewed your pull request and all looks good. Unbeknownst to either of you, someone in another team has been working on an enormous change to the same code you’ve been working on.
Your local test suite passes. That other person’s test suite passes. You both merge your changes within a few minutes of each other and master ends up broken. You start to unpick the damage and you wonder how you came to be at this point in your life.
In a world where we test extensively, run code reviews, and put faith in our own ability to do the right thing, there’s still an argument that merging directly to master should remain sacred. After all, breaking master puts a block on the whole team’s productivity.
Rather than allowing anyone to merge to master, you can set up a merge bot. A merge bot enforces an ordered queue of merges. Some bots work by looking out for a particular tag on branches and then adding those to the merge queue. So, rather than merge your branch with master you might tag it “FOR-MERGING” or similar.
The merge bot picks each proposed merge in turn and checks that it is up to date with the current state of master. If it is, then it runs the test suite and merges if all is good. If it’s not up to date with master, you as the developer will need to merge master back into your branch before resubmitting to the merge queue.
Using such a queue means that master is always green and that two merges that conflict with each other can’t both land.
Long running test suites are productivity killers. If it takes 40 minutes to run your tests, then that’s an additional 40 minutes of dead time on each occasion that you come to merge. A test fails? Well, that’ll be another 40 minutes after you fix the issue.
But it’s not just about productivity. It’s a morale issue, too. You don’t want to dread getting ready for a pull request. Sure there are other things you can get on with while the suite runs but context switching has a mental cost.
There’s a third issue, too. Long running test suites lead to merge rot. Let’s say you have a bot enforcing a merge queue. If your test suite takes an hour and there are five merges ahead of yours in the queue, that’s five hours’ worth of changes that could make your branch unmergeable. Once the bot tells you your branch isn’t getting in, you get your branch back into a mergeable state, and then .... sit through yet another test run.
There are several ways to reduce your test suite run times. Run tests in parallel for non-dependent areas of the code, cache external dependencies, increase the resources available to your CI server. But one of the most effective is to set a recurring calendar entry to check for and delete old tests. Fewer tests, shorter test runs.
Caching is one way of reducing test suite run times but it’s worth calling out by itself. Like any cache, you’re making a tradeoff. In this case it’s between the speed of your test suite and the freshness of any dependencies.
Let’s say your test suite builds several Docker containers, pulls thirty NPM repos, and relies on data from ten external APIs. Preparing that for each test suite run would be a waste of bandwidth and, more importantly, extend the run time significantly.
However, what if one of those NPM repos makes a breaking change between the cache point and when the suite runs? Presumably your own due diligence would mean you’re aware of planned changes but then let’s not forget leftpad.
It’s probably good enough to clear your caches daily and make sure that the first test run of the day pulls in everything anew. However, there’s a decision to make over which external resources shouldn’t be cached at all because they’re more likely to change frequently or any changes would have a significant impact on your code.
Continuous integration has changed how we build software. It has, for the most part, improved reliability and reduced the chances of embarrassing mistakes making it into production.
However, CI is not a fire and forget tool. It needs regular care and attention in order to make the greatest contribution to your code quality. Today’s new ideas will be tomorrow’s basic requirements. So, if you want to get the most from CI then make sure to make refining your approach a regular part of your team’s activities.