As a product designer, A/B Testing is an important process you should know, because it replaces assumptions with quantifiable data, and helps you gain actionable insights into user behavior to improve usability and make user-centered design decisions.
A/B Testing, also known as split testing, is a user experience research method, where you split your audience to compare two variations "A" and "B" of your product design to determine which is more effective. You can test varieties of elements with A/B testing, from images to layouts, interfaces, buttons, and many more. It basically replaces guesswork with data-informed decisions for positive results.
But just how do you conduct effective A/B tests? And how do you analyze the data and convert it into actionable insights? Like any design endeavor, effective A/B testing should follow a clearly defined approach. This guide defines that approach, and how you can use A/B testing for product design improvement.
The first step is to conduct research to understand the issues and problems users have with your product design, and what would appeal to them instead. A/B testing is informed by research, so you have to analyze how users interact with your product to know the aspect that most needs optimization, and use the data to determine the direction of your A/B test.
For example, if you found out only a few users are using the new feature you launched. The problem could be the navigation system, due to the fact that poor navigation affects usability.
So let's assume the problem is the navigation system, and that's the variable we are testing for.
Before carrying out your A/B test, you need to have a goal for the test, i.e, you have to set key performance indicators that'll determine the success of the test. In our case, what are we looking to improve with the test? We are looking to increase the feature adoption rate, by redesigning the navigation system.
So our objective here is to redesign the navigation system and check if the new variation increases the adoption rate.
Before we formulate a hypothesis for our A/B test, let's understand what it means. Put simply, a hypothesis is a testable statement that predicts a suggested outcome. It is basically an assumption that can be verified by testing. From the example above, our hypothesis will be:
"If we redesign the navigation system, the feature adoption rate will increase"
However, in hypothesis testing, we have to make two hypotheses:
This hypothesis states that there is no relationship between the two variables being tested, and the outcome of the test is due to pure chance, i.e there's no real effect behind the data our A/B test has produced, and our new variation doesn't make any difference. The null hypothesis in our A/B test example will be:
" A new navigation system will not increase the new feature's adoption rate"
But if the results provide enough evidence against the claim that the A/B test didn't have any effect, then we can reject the null hypothesis.
"A new navigation system will increase the new feature's adoption rate"
Now we'll have to collect enough evidence from our A/B test to reject the null hypothesis and prove that the alternative hypothesis is true.
The next step is to decide the group of users that will participate in the test. In A/B testing, we have two groups – The Control group, and the Test group. The Control group is the group of users that'll be shown the existing version of the navigation system during the A/B test, and the Test group will be shown the new variation throughout the A/B test.
You have to create the two groups by randomly selecting the users using a method called random sampling to eliminate bias. You also have to choose a sample size before conducting the test to eliminate undercoverage bias, i.e a bias that occurs due to sampling few users. To avoid this bias, use a sample size calculator from Optimizely, VWO, Google Optimize, or other A/B testing software to calculate the sample size your A/B test will need.
You can run the test manually by measuring the performance of the two variations daily. In our case, we'll measure the daily "adoption rate" of each navigation system, or you can use any of the A/B testing software mentioned above to track the performance automatically. This software ensures that each variation is automatically shown 50% of the time to eliminate bias.
The testing period might last several days or weeks, depending on how long it takes to ensure that each variation has been shown to the appropriate sample size. The interaction of each user will then be measured and analyzed to determine the performance of each variation. If you run your test for a longer period of time, more users will be involved and this lowers the risk of results being down to chance.
This is the most crucial stage of the A/B test because this is the point that determines which variation has the higher user engagement. Let's say we analyzed our A/B test example, and the result states that the test group has a 15% adoption rate and our test group has 45%.
Does this mean the new variation is doing better than the former? The answer is no, we can't draw conclusions without testing the validity. Regardless of whether an A/B test has a positive or negative outcome, you have to calculate the statistical significance of the test to prove the reliability of your results.
Statistical significance measures the certainty of the result obtained from an A/B Test. The higher its value, the more you can trust that your result is not due to error or random chance.
To fully understand Statistical significance, you have to be familiar with these terms:
Significance level or α, is the probability of rejecting the null hypothesis when it is true, i.e the test results could have occurred by chance. You can choose a significance level of 5% or less. This means that there is a 5% probability that your results occurred by chance. We'll compare our significance level with the terms below to determine our result's statistical significance.
Confidence level measures the degree of certainty or uncertainty of an A/B test. The confidence level is equivalent to (1 – α) significance level ). So, if your significance level is 5%, the corresponding confidence level is 95%. This means that there's only a 5% probability that your results occurred by chance. Then we can say that you're 95% sure that variation A is better than Variation B"
P-Value is the probability that your results occurred as a result of random chance. If your P-value is lower than your significance level (α) , then your results are statistically significant. The lower your p-value, the higher the statistical significance of your A/B test. A P-value greater than 5% means your null hypothesis is true, and your results occurred by chance. (P-value ≤ α = Statistical Significance) ✅
Calculate your P-value using this formula:
You can use either P-values or confidence levels to determine the statistical significance of your results, but using a statistical significance calculator is much simpler and more efficient than manually doing the calculations. You can use a statistical significance calculator from Optimizely or Google Optimize to calculate your statistical significance. If your result is statistically significant, you can go ahead with your new variation, and continue to improve your product design with A/B testing.
A/B testing provides clear data to support or debunk product design decisions, so you can ensure every implementation produces positive results. It also helps you understand how certain elements of your design impact user behavior so you can improve the areas that need optimization. The sooner you start testing, the sooner you can identify areas for improvement, and the better your product design.
Also published here.