4,285 reads

AB Testing on Small Sample Sizes with Non-Normal Distributions

by Iaroslav SvetSeptember 18th, 2023

Too Long; Didn't Read

The topic of experiments on big data has been well-studied recently, with many established approaches and best practices. In B2B settings, experiments need to be conducted with a few thousand or even hundreds of observations, and their distribution may be even more skewed than in user data. All the methods listed below are applicable to samples of any size, but they gain greater value in small samples.

featured image - AB Testing on Small Sample Sizes with Non-Normal Distributions

In this article, we will explore the intricacies of AB testing on small sample sizes, which can be valuable in B2B settings or products with a limited user base that still want to conduct experiments.

Many who have dealt with AB tests in large internet companies are accustomed to having a wealth of data, and the sample size for experiments can be enormous. Even if the key metric is not normally distributed, which often happens, it doesn't pose significant challenges. With large samples, you can even afford excess modifications. For example, observations can be grouped into buckets, and the average metric values within these buckets can be calculated. You can then work with the distribution of these averages, which tends to be close to normal due to the Central Limit Theorem.

The topic of experiments on big data has been well-studied recently, with many established approaches and best practices. However, experiments in B2B can significantly differ from what we're accustomed to. They bring us back to the world of small data, where experiments need to be conducted with a few thousand or even hundreds of observations, and their distribution may be even more skewed than in user data.

All the methods listed below are applicable to samples of any size, but they gain greater value in small samples.

Non-Parametric Tests

The non-normality of the distribution is, of course, a minor issue. Using non-parametric tests can be very helpful here. Instead of using Student's t-test, consider using the Mann-Whitney test, for example. If you haven't encountered non-parametric tests before, it's a good idea to get familiar with them sooner rather than later. The advantage of non-parametric tests is that they don't require assumptions about data distribution and can be used when the distribution is not normal or when the data contains outliers.

Non-parametric tests are based on data ranks rather than values. Data is ranked, and various statistical methods are then used to test hypotheses about differences between samples. There are plenty of resources on this topic, ranging from very basic to more scientific. For example, one of the most cited works on this topic was published in 1957: Link to the paper, and more accessible descriptions can be found on the web

It's worth noting that the Mann-Whitney test doesn't work well with ties, meaning a large number of repeated values in the sample, so you should check if this situation applies before using it.

Outliers

Any modification of the sample is a contentious and responsible decision, especially with small samples. Removing even a few observations can significantly impact the results, and there's no guarantee it's an improvement rather than a deterioration of the signal. However, if you've decided that something needs to be done with outliers, consider alternatives to outright removal. One such technique is Winsorization, which replaces extreme values with the most extreme values within specified criteria: Link to Winsorization.

Variance Reduction

Additionally, it may be appropriate to apply one of the accepted methods for reducing variance. A good example of such a technique is CUPED, which emerged at Microsoft a decade ago. In its simplest form, the main idea is to compare the change in a metric before and after instead of the metric itself for groups A and B. You can find a canonical work on this topic here: Link to the paper.

Formation of A and B Groups

All the methods mentioned above are valuable for improving the quality of results, but in the case of small samples, they may not help if you don't pay sufficient attention to the logic of how A and B groups are allocated early on. The usual random allocation to groups, which is canonical for tests on large data, can be detrimental in our case. For example, we may doom an experiment to failure by putting the two largest companies in the same group.

The solution to this problem is quite simple: random pairwise allocation. Sort the list from largest to smallest and then randomly assign neighbors to groups A or B. It might look something like this:

1000 - A

930 - B

800 - B

730 - A

520 - B

511 - A

In Conclusion

All the above should help you conduct experiments with small data, but the main recommendation is the same for all experiments: first, make sure the experiment is necessary in the first place, and only then delve into all the mathematical and quasi-mathematical magic.