Don’t you hate when things are not deterministic? A test should constantly pass or fail if no code changes are applied. We should run our tests against a controlled environment and make assertions against an expected output. We may use a test fixture as a baseline for running tests. A test fixture is a fixed state so the results should be repeatable. A flaky test is a test which could fail or pass for the same configuration. Such behavior could be harmful to developers because test failures do not always indicate bugs in the code. Our test suite should act like a bug detector. Non-determinism can plague any kind of test, but it’s particularly prone to affect tests with a broad scope, such as acceptance, functional/UI tests.
A good suite of tests should let you decide whether the code is ready to be released. When I have a test suite that I can trust, a successful test run gives me the green light to proceed with a release. It gives me confidence that I can refactor the code safely. In TDD, we should run all our tests after every code change. Sometimes this is not always possible, but at least every now and then we have to run the whole suite of tests. But at least, we have to ensure that all our tests run successfully after committing our changes. If a test constantly fails, this is not a flaky test and must not be confused.
But how you could introduce a flaky test? Let’s see some common reasons a test could be flaky:
Continuous Integration is the practice of merging all developer working copies to a shared pipeline several times a day. A flaky test could block/delay development until spotted and resolved. The problem is that you do not know if you caused the test failure or if it is flaky. There is no easy way to deal with flaky tests. But there are some practices that could help you spot them and deal with them.
As a very first step, re-run all failed tests with clean system state. This is an easy way to identify if the failed tests are constantly failing or they are flaky. But a successful re-run does not mean that you can ignore the flaky test. It is an easy way to identify that test is flaky indeed and you have to deal with it. There are tools that support automatic re-running failed tests in development or CI environment that could help you get through.
Place any spotted flaky test in a quarantined area. Teams should follow a strict process when spotting a flaky test. After you record this down, you could also place this test in the quarantined area. This will let others know that this test is possibly flaky and will be investigated. But the main reason is that all other healthy tests will remain in trust. This does not mean that you can postpone the investigation. Shortly someone has to pick this up. You can enforce this by setting either a number limit of quarantined items or a time limit in the quarantine area.
Running tests frequently in scheduled builds at different times of day could reveal flaky tests. It is better to spot a flaky test early rather emerging during a release.
In order to deal with them, you should somehow record all the tests that are flaky. Upon a failure, you have to gather all related data. Logs, memory dumps, system current state or even screenshots in UI tests, that can help you investigate later what went wrong. A ticketing system works fine for storing all that data. This will let you know how many flaky tests are they. You can create a new ticket for that flaky test so someone will pick this up.
When you have identified that a test is flaky, if this test lives long in your codebase, you should try to figure out when it was introduced. As for example, if this test has failed in your CI pipeline again, you can try to find out what code changes could have affected its behavior.
Tests that make assertions on dynamic content have to wait for content to load. Putting a test to sleep for some time is not a good practice. UI tests are slow enough and you don’t want to make them even slower. You could use callbacks if those are provided by the dynamic content provider. If there are no callbacks, you can use polling in small wait intervals. The wait interval is the minimum time that you have to wait when content is not available, thus it should be short. But also, it should be easily configurable. Test run environment could change, so the wait interval will need tweaking over time.
Tests that usually pass but rarely fail, are hard to reproduce. This is where the data that we mentioned earlier that should be gathered can help. Once we spot them, we have what is needed to reproduce the faulty scenario. Another way to investigate those is running the test multiple times till you end up with a failure. Then we should do some post-mortem analysis to identify the root cause. Unfortunately, this is not an always win procedure, but it is free of cost while you are investigating possible reasons.
The best way to deal with time bombs is wrapping the system clock with routines that can be replaced with a seeded value for testing. You can use this clock stub to time travel to a particular time and frozen at that time, allowing your tests to have complete control over its movements. That way you can synchronize your test data to the values in the seeded clock.
As said, a carelessly written test that does not clear its state after execution could waste you a lot of time, trying to figure out why other tests are failing. Those tests might assume that system is in a vanilla state which also wrong. A way to deal this kind of flakiness is to rerun all your tests in the same order when it failed. A test might pass when running separately and fail under specific execution order. In general, you should configure your tests to run randomly to identify tests that could get affected by other bad written tests. Most testing libraries provide a way to execute tests in random order. Use this option, as it will force you to write more resilient and stable tests.
When having a big suite of tests, it is hard to avoid having flaky tests, especially on UI/integration tests. Usually, the insertion rate is the same as the dealing rate. There should be a level of awareness in the teams about flaky tests and should be part of the team culture to guard the tests. After all, its team’s productivity that gets affected. When you get used to seeing your pipeline red, you inevitably pay less attention to other problems as well. One recurring problematic test becomes unreliable, so unreliable that you ignore whether it passes or fails. To make things worse, others will also look at the red pipeline and notice that the failures are in non-deterministic tests, but soon they’ll lose the discipline to take any actions. Once that discipline is lost, then a failure in the healthy deterministic tests will get ignored too. A red pipeline should be like an alert. It is like the traffic lights. Red means we should not continue the development!
As a rule of thumb, if you face a flaky test, do not assume that this is a test problem. You should suspect production code first and then the test. Sometimes a flaky test can be flawless and has just revealed a bug in your code. Just remember, a bug’s best place to hide is a flaky test that developers would assume that something is wrong with the test and not the code.
Eradicating Non-Determinism in Tests
No more flaky tests on the Go team