In this post, we will discuss the common pain point in developing an end-to-end (E2E) test suite: the flakiness. Flaky tests are tests that fail even though they should pass. Because of the complexity of an E2E, it’s unlikely to have them as stable as unit tests—there will always be some flakiness, but our job is to make sure this quality does not make tests useless.
E2E instability creates issues on many levels. On the one hand, it diminishes benefits that the tests could bring to the project; and at the same time, it increases the cost of maintaining the suite. Let’s look at some different ways this instability can occur.
First and foremost, it’s simply annoying for team members to constantly deal with random E2E failures. Depending on your continuous integration (CI) setup, random failures can require manually restarting tests; block the steps that follow in the integration pipeline; and/or in all cases make everything slower. Too much of this annoyance will make it even more difficult to convince your developer colleagues to write and maintain E2Es.
Besides the annoyance, random noise in E2Es can make it difficult to catch issues that appear in a nondeterministic manner. Let’s say some feature in your app fails once every 20 attempts: in theory, E2Eis the perfect tool to catch this kind of issue. But the automated tests will not help if you and your team have a habit of rerunning tests until they pass. You need a little random noise to notice a subtle signal that appears randomly.
It’s difficult to trust test results if you have a habit of rerunning tests every time they flag an error. Flaky tests make everybody question their results, and they train the team to treat the failing E2E CI job as a nuisance. This is the opposite of what you need to get the benefits of adding automated tests to your project.
Before we go further, let’s revisit the math basics of E2E tests. Similar to medical tests, you can understand the E2E suite as a test that catches bugs:
Flaky tests are cases of false positives: some tests are failing, even though there is no regression.
To analyze flakiness, you need concepts and terminology from the theory of probability. We can express this probability as a ratio of false positive results to the total number of tests run. By keeping the branch stable and rerunning tests many times, we can evaluate these values for:
the test suite as a whole and
each test separately.
To keep the theory to the minimum, we can ignore the reverse problem—tests passing randomly even though they should fail. This usually means that our test coverage is inadequate, and the problem can be solved by adding some new test cases.
Usually, we consider a test suite to fail when we have even just one test failing. This leads to an unintuitive impact of individual test stability on the whole suite. Let’s assume our tests are flaky at 1 random failure for 6 test runs—so we could map a random failure as getting 1 from a die roll.
When we have only one test in the suite, the calculation is simple:
When we have two tests, we can map our problem as rolling two dice: if some of the dice show 1, this test fails, and therefore our suite fails. For everything to work as expected, we need the first test to pass, which has ⅚ chance, and the second test to pass as well—again with ⅚ probability. When we assume that both tests are independent, we can multiply the probabilities to find the combined probability of both tests passing. So, the final results are as follows:
25/36, about a 0.69 chance of true negative results
1 - 25/36=11/36, or about 0.31 chance of false positive result
As you can see, adding a new test made the false positive results significantly more probable.
The general formula for the suite’s false positive rate is the following:
Where:
With this formula in mind, we can see that flakiness of individual tests has an enormous impact on the stability of the whole test suite:
N ⟍ Pt |
1/6 |
1/10 |
1/100 |
1/1000 |
1/10000 |
---|---|---|---|---|---|
5 |
0.59812 |
0.10331 |
0.01866 |
0.00340 |
0.00062 |
25 |
0.98952 |
0.42030 |
0.08988 |
0.01687 |
0.00309 |
100 |
1.00000 |
0.93454 |
0.37557 |
0.08154 |
0.01536 |
500 |
1.00000 |
1.00000 |
0.90507 |
0.34641 |
0.07448 |
The rows show different numbers of tests in the suite, and the columns show the probability of each test failing. Cells on the cross section show the probability of the suite failing with a random error. As you can see, when you add more tests, their instability accumulates very quickly.
Now we know how the test number and stability work together in the suite. Let’s go through possible causes of random test failure.
Depending on your system architecture, it could be difficult to isolate your tests perfectly—especially in places where you are connecting to external systems. The application on which I work has the following backends:
Each non-isolated server that is used by your E2E can cause issues:
To provide necessary isolation from those external systems, you have a few options:
move more infrastructure to be run specifically for each job that runs E2E—something that can be easily done with Docker
implement dummy proxies on your backend—and making sure the dummy implementations are only used in tests
mocking backend requests with your E2E framework.
Option 1 allows you to truly cover both the backend and frontend with tests. Options 2 and 3 allow testing the frontend without checking the backend—a departure from the idea of E2E testing, but it is sometimes necessary.
Sharing data across tests can lead to unexpected failures, especially when you combine two things:
tests running in parallel or in random order
data left behind after tests
Usually when I create a test, I try to clean up the data I create. So, if I want to test creating and removing functionalities of my app, I combine them into one test. For other operations, I try reverting them in the test as well.
At some point, I was running multiple instances of a test runner against the same backend and database. Any data sharing across those tests was causing random failures in some tests. To address this problem, I moved my tests to depend mostly on the data I create on the fly, just before the test is run. The migration was pretty time-consuming, but it allowed for running tests in parallel while keeping them all in one CI job.
A few years ago, E2E tests were difficult to write because the tools did not track the state of the app very well—so you needed to manually program waits to make sure that the test runner did not try to interact with the app while the data was still loading. Modern tools, such as Cypress, are much better at waiting for the application to load data. But even now, I sometimes struggle with random issues created by the tests. Some examples are as follows.
Most importantly, it’s sometimes caused by the application failing randomly. This kind of issue is pretty annoying for users and developers. Even with automated tests, you need to repeat the same test over and over again to have a chance of seeing how the bug occurs.
Issues like this can be perplexing for users because we normally expect the same actions to lead to the same results. This confusion will appear in the bug reports as well—not a great start for troubleshooting.
Being serious about troubleshooting E2E instability helps find and address those issues before they affect customers. The upside of expending all this effort is that we can avoid creating the impression that our application is unreliable.
We have two other options to improve the E2E stability of our projects.
As we have seen in the table above, even tests that fail only once every thousand runs can become pretty unstable when we have 500 of them. Luckily, in what I saw in practice, the instability is never distributed in such a uniform manner. It’s usually a handful of unstable tests that cause the suite to fail. This means that you can focus on troubleshooting tests that you see failing most often and push the overall stability enough to avoid random failures causing too many problems.
Recently, I migrated my project from running all E2E in one job to running separate jobs for running E2E-related tasks to different parts of the application. This change brought a few improvements:
Besides the solution I use and recommend, there are few approaches that feel much more like ‘hacks’ to me.
I always opposed rerunning tests automatically. My main issue is that it makes it effortless for developers to just ignore anything that happens in the tests in a nondeterministic manner. In this way, it invites leaving unresolved, annoying E2E problems and actual code issues that can affect users.
When I develop code, I manually choose what E2E test I want to run—the ones that have a chance to be affected by my changes. As your test suite grows, execution time gets bigger, and more tests exacerbate stability issues. It can be tempting to consider getting smart on the CI side as well. You could think about some ways of automatically finding which tests can be affected by the change and run only tests that should see some meaningful changes.
I see the following issues here:
If you consider a test suite to fail when you have only one test failing, is it necessary to continue running tests after one of them fails? Failing faster would allow you to rerun tests earlier and save some CI resources. That being said, I still avoid finishing the E2E after the first test failure for the following reasons:
Also published here.