Tests are an essential part of software development as they give a ground truth about code sanity. As developers, we reasonably expect our unit tests to give the same results if the source code does not change. However, it can happen that . Such a test is named a . the result of a unit test changes over multiple executions of a test suite without any change in the code flaky test A flaky test is , diminishing the benefits of the latter. It is thus recommended to eradicate such issue as soon as possible. not dangerous per se but reduces the confidence a developer can give to his test suite However, depending on the origin of the flakiness, one may find out only a few days, months, or even years later that the tests are flaky. It may be hard to dive back into those and find the root causes, . so we usually tend to put those tests aside to make them less annoying, or we rerun them until success As real-world examples of flaky tests and the logic behind their resolution, I will discuss two interesting cases I had the opportunity to fix during my career. Storytime! During my career, I stumbled onto a couple of flaky tests issues. There are two instances that, in my opinion, are quite symptomatic of test flakiness, with quite different contexts. The examples are voluntarily adapted to a simpler context than the original ones to keep a short and focused article. Story 1: 5 days a month isn’t a big deal I entered a project where continuous integration was broken during the 5 first days of each month. I was told that this wasn’t a big deal since we don’t need to deploy this project at the beginning of a month. The priority of fixing those tests was so low that it had remained like this for years before we took the time to tackle this issue. month = arrow.get().floor( ) data = retrieve_data(month) len(data) now = arrow.get() add_data(now.shift(days= )) add_data(now.shift(days= )) stat = compute_stats() stat == , : def add_data (input_datetime, data) # Store the data along with the input date # [...] : def retrieve_data (lower_date) # Retrieve all data from lower date to now # [...] : def compute_stats () # Compute some statistics about stored data 'month' return : def test_compute_stats () # Test method checking the behaviour of compute_stats -5 -1 assert 2 "We retrieve the two data input" The faulty feature was computing statistics about the current month. As the developer creating the initial tests wanted to take all cases into consideration, he created a test that gave as input multiple dates relative to the current . datetime Among those inputs, one was and the test was always computing the statistics as if it was part of the same month. As a result, it led to the tests being faulty at the beginning of each month. We can then imagine that the flakiness was detected under one to three weeks after the feature development and, from then on, ignored. 5 days before the current date, This test is time-dependent because will call for the date of analysis itself. As a result, the computation will always be dependent on the current date. One way of fixing such issues would be to sandbox the execution of the tests to control the current date. compute_stats At first, we wanted to rely on and make ask for a month to compute the statistics. This would create an easy way of sandboxing the execution and also potentially open the door to new features. However, in this project, this wasn’t trivial to implement because there was a lot of code dependent on this feature. dependency injection compute_stats Another way of doing so would be to inject the value directly into the method. Python has a very good library to sandbox the tests when using the built-in objects: . Once again, and unfortunately for us, the project was using so this was not a possibility. datetime freezegun arrow Fortunately, and thanks to some previously well-thought environment on the project, we had a central method to provide the current date, initially intended to prevent the use of a wrong timezone. By mixing this method with the awesome of the python mock library (which is part of the standard library since 3.3), we solved the issue with a simple modification. patch decorator unittest month = arrow.get().floor( ) data = retrieve_data(month) len(data) now = arrow.get() add_data(now.shift(days= )) add_data(now.shift(days= )) stat = compute_stats() stat == , : def add_data (input_datetime, data) # Store the data along with the input date # [...] : def retrieve_data (lower_date) # Retrieve all data from lower date to now # [...] : def compute_stats () # Compute some statistics about stored data 'month' return : def test_compute_stats () # Test method checking the behaviour of compute_stats -5 -1 assert 2 "We retrieve the two data input" By sandboxing the execution to a given point in time, we ensured the reproducibility of the build at any given time. Story 2: We use that configuration! In another project, while creating a new feature, we happened to break tests unrelated to our changes. This case could have been tedious to pinpoint. Fortunately, due to the project organization, we were certain that the new feature did not affect the code covered by the now failing tests. The code below is a synthetic representation of what happened. A global object was interacting with both the existing and new features. config config = {} config: ValueError( ) config: ValueError( ) # Global state configuration : def existing_feature () if "common_entry" not in raise "Not configured" # Process [...] return True : def our_new_feature () if "common_entry" not in raise "Not configured" # Process [...] return True From the isolation of the two features, we knew that the new tests had to create a faulty global state. There were globally two possibilities for the faulty state. Either the new test was injecting something new to the global state, or removing something essential to the existing test. The test cases below were always run in the specific order before integrating the new feature, then in the order ConfiguredFeatureTest , ExistingFeatureTestCase ConfiguredFeatureTest, NewFeatureTestCase , ExistingFeatureTestCase . = def test_configured(self): self.assertIsNotNone(config.get( )) = def tearDown(self): config.clear() def test_new_feature(self): self.assertTrue(our_new_feature()) ( . ): ( ): [" "] class ConfiguredFeatureTest unittest TestCase def setUp self config entry "anything" "entry" ( . ): ( ): . ( ()) ( . ): ( ): [" "] class ExistingFeatureTestCase unittest TestCase def test_feature_one self self assertTrue existing_feature class NewFeatureTestCase unittest TestCase def setUp self config entry "anything" To understand the behavior of the existing test, we ran it alone, both with and without the new changes. It appeared that the test was failing in both cases. This gave us the information that this test was using an existing global state and that we might be cleaning this state. So we took a deeper interest in the method. tearDown It happened that the global configuration was injected and cleared in our new test suite. This configuration was used but rarely cleared in other tests. As a result, the existing test was relying on the execution of the previous ones to succeed. Clearing the configuration removed the context required by the existing test, thus made it fail. By , the tests were always run in the right order for years. This situation could have been detected way earlier by using a random execution order for tests. . chance It happens that python has simple modules to do so To fix the tests and avoid this situation in the future, we decided to force the configuration clearing in the project’s test suite superclass. This meant to fix a bunch of other tests failing after this but also enforced a clean state for all new tests. Bonus story: Good flakiness On top of the previous stories, where flakiness is obviously a bad thing, . I also stumbled into a case where I found flakiness somehow beneficial to the codebase In this particular case, the test intended to assert some data consistency for any instances of the same class. To do so, the test was generating numerous instances of the class with randomized inputs. This test happened to fail during some executions as the inputs were creating a behavior not accounted for by the feature. That enabled to extract a specific test case for the input and fix the behavior in this case. While I agree that edge cases should be analyzed during the development process, sometimes the input scope is too wide to consider all of them, let alone test all possibilities. In those cases, randomizing the input of a method that should keep a consistent output is a good way to assert the codebase sanity in the long run. Conclusion This article showed some real-life examples of flaky tests. They were not the worst to track down, but they pinpoint the fact that flakiness can resurface at any time, even years after their introduction! Once they appear, the flaky tests need to be fixed as soon as possible out of fear that the failing test suite will be considered as a normal state. The developers may then not rely on the tests suites anymore. Flakiness is mostly due to non-deterministic behavior in the code. In this article, we had an example of: If the code is dependent on time, there may be failing tests on specific dates. Specific execution time . Using random values in the main code or in the tests needs extra care, or the behavior may vary depending on those random values. Randomness. Using a global state in a project can create inconsistencies in tests if the state is not managed correctly. Modified global state. Some behavior can help to limit the amount of flakiness that hides in the tests, such as: to keep reproducible execution. Control your test execution environment to minimize the side effects of environmental settings. Avoid global states to determine dependencies between tests. Randomize the test execution order Continue on the subject: https://hackernoon.com/flaky-tests-a-war-that-never-ends-9aa32fdef359 https://martinfowler.com/articles/nonDeterminism.html https://docs.pytest.org/en/stable/flaky.html If you are curious about the context that led to the apparition of those flaky tests, my former manager provides a more detailed view in a very interesting post: . Kevin Deldycke Billing Pipeline: A Critical Time Sensitive System Photo Credit: Red light: https://gph.is/2qcADfF Fingers Crossed: https://giphy.com/gifs/xUOwGkoxib4Y9A9T9u Also published here .