A/B testing is a data-driven approach that enables making the right choice between different versions of a product. These versions are randomly presented to users, and their behavior patterns are logged and carefully stored. This data is then analyzed to determine which product version would provide the best results, e.g. increased conversions, user involvement, etc. Many things can be subject to A/B testing: new design, user interface, or even an entirely new user logic for a website, application, or service.
The idea sounds very simple, however, it often turns out tricky to get it right. Drawing some conclusions based on data is easy, but making the right decisions is the difficult part. Tackling potential issues starts long before the actual A/B test because asking the right question in the right manner is a necessary prerequisite for getting the right answer. In this article, we will study the most common traps in A/B testing and ways to evade them.
Remember that trying to seek, identify and explain patterns is an essential part of human nature. Diving into A/B testing results without a clear idea of what exactly we are looking for may and will lead to misinterpretations. We spent effort setting up an A/B test because we believe in the change we are making, so we tend to explain the A/B test results in our favor. Coming up with an explanation of the data we see, is not much better than not doing any A/B testing at all. A hypothesis is something any A/B test should start with. It is best formulated in an “if/then” manner.
For instance, “if we make the ‘add to cart’ button red instead of blue then we increase conversion rate”. A clear hypothesis helps set clear goals for the test, make the right test design, and, very importantly, interpret results in the correct way. Also, a clear hypothesis is important in communicating the test goal and results to teammates and stakeholders, making sure everyone has the same expectations.
Setting clear test rules and sticking to them is just as important as putting forward a hypothesis. Runtime is one of the most fundamental parts of these rules. Failing to set the runtime and finish the test exactly when planned would inevitably lead to misinterpretation. Imagine we start an experiment, peek at the data daily and halt the test as soon as we see favorable figures. Most probably, our result will lead to an incorrect decision. The reason for this is the same as with an unclear hypothesis: human ambitions drive us to interpret results in our favor.
It is fine to have a peek into data while the test is still running, but only to make sure everything is ok technically. If there is a bug or an infrastructure failure, we should stop the test, discard the results, correct the flaw, and restart from scratch. Under no circumstances should we tamper with the test procedure or timeframe and ship the results as acceptable.
When we conduct a test, we should normally expect to have ~50% of users in version A and ~50% in version B, we should always check the actual observed split between the groups. If it is significantly different from the desired 50/50 proportion, this is a red flag indicating that the test results may be flawed. For instance, if version B takes just a few more seconds to load, some users might quit the webpage without actually seeing it and without being tracked by the test logs. Ignoring these users may lead to wrong conclusions. Fixing such issues requires carefully going through all technical details of the test, identifying potential leak points, and adding checks to the test framework.
A very important point when restarting an A/B test is actually performing a ground-up restart, not continuing the test after a pause. For instance, we start an experiment, peek into data, and observe an unexpected conversion drop-in group B. We dive into the technicals and find a bug. We fix it, but then continue the test right from the point where we stopped: we let A users stay in version A and B users stay in version B after the fix. In the end, our results are most probably compromised because some of the group B users who encountered the bug might have left the platform never to come back. Once again, we have a potential user leak point leading to a sample ratio mismatch and affecting the final result. In A/B testing, there is no place for a “pause/play” button. If a test is stopped and its conditions are altered, it should get a fresh start with previous results ruthlessly scrapped.
Failure to take into account the context of the test may also twist the sample ratio and deliver misleading results. To illustrate this let’s consider an imaginary “Darth Vader experiment”. For instance, version B has a picture of Darth Vader, while version A has not. In the test result, we see a notable click-through rate increase in group B. We jump to the conclusion that version B works better. However, we did not take into account that the test was held at the same time as a new series of the Star Wars franchise was released. So the test results really reflected the fuss about the new Star Wars movie, not the version B superiority.
In another example, we proudly display Darth Vader in version B of a website’s main page. Some people don’t like Darth Vader, so conversion in B drops. We restart the experiment reshuffling A/B visitors to avoid the sticky experiment assignment problem. However, the conversion is still getting worse. We repeat the reshuffle-restart procedure multiple times. After enough iterations, only people who like Darth Vader or are indifferent to him would stay on the platform, so the experiment finally becomes conversion positive. But the result is unacceptable because we artificially narrowed our sample to a Darth-Vader-tolerant audience.
Despite being a statistical experiment, A/B testing is quite close to being a form of art. Its apparent simplicity is deceiving, as many variables should be taken into account. Being a professional in A/B testing requires seeing the big picture, which includes all aspects of the product subject to testing; technical issues that might affect different versions’ perception; real-life context of the experiment, and many other factors that may have an unexpected influence on the test results. Moreover, understanding human nature and taking a critical approach to interpreting data no matter how good it looks at first sight is a must.