is a powerful tool for drawing conclusions and making predictions about populations based on sample data. It and understand the effectiveness of different options. One popular application of statistical inference is , where we compare two versions or treatments to determine the superior performer. However, Statistical inference allows us to make informed decisions A/B testing what happens when we introduce more versions or treatments to the experiment? It may seem that the introduction of additional versions in an experiment is an opportunity for even better decisions. , if not handled properly, . This challenge is known as the . Unfortunately the increased number of testable hypotheses can lead to misleading results and incorrect decisions multiple comparisons problem In this article, I explain the concept of multiple hypothesis testing, its potential pitfall, and give one possible solution supported by a Python simulation. What is multiple hypothesis testing? To understand multiple hypothesis testing, let's begin by examining the fundamental concepts of a . simple A/B test involving two variants In an A/B test, we start by formulating two competing hypotheses: the , which represents the absence of a difference between the variants, and the , suggesting the presence of a difference. null hypothesis alternative hypothesis Then we set a denoted as . This threshold determines the . Commonly used significance levels are 0.05 (5%) and 0.01 (1%), indicating the probability of significance level alpha amount of evidence required to reject the null hypothesis observing the data if the null hypothesis were true. After running the experiment and collecting the data, we calculate . The . If the p-value is less than the significance level , we reject the null hypothesis in favor of the alternative hypothesis. p-values p-value represents the probability of obtaining a result as extreme as, or more extreme than, the observed data if the null hypothesis were true alpha It is important to note that a low p-value suggests strong evidence against the null hypothesis, indicating that the observed data is unlikely to occur by chance alone. However, this does not imply certainty. There remains a non-zero probability of observing a difference between the samples even if the null hypothesis is true. When we encounter a situation with , we refer to it as . In such cases, the complexity increases as we need to carefully consider the potential impact of conducting multiple tests simultaneously. multiple alternative hypotheses multiple hypothesis testing Pitfalls of multiple hypothesis testing The pitfall of multiple hypothesis testing arises when we test multiple hypotheses without adjusting the significance level . In such cases, we , meaning that we tend to (find the difference) (no difference at all). alpha inadvertently inflate the rate of ”Type I” errors reject a null hypothesis while this null hypothesis is in fact true The more hypotheses we test simultaneously, the higher the chances of finding a p-value lower than for hypothesis and erroneously concluding a significant difference. alpha at least one To illustrate this issue, consider a scenario where we want to test hypotheses to determine which of multiple new webpage designs attracts more customers with desired . Let's assume we know that , meaning the null hypothesis holds for all cases. N alpha = 0.05 none of the new designs is better than the default one N However, for each case, there is a 5% probability (assuming the null hypothesis is true) of committing a , or a . In other words, there is a 95% probability of correctly not detecting a false positive. Theoretically, the probability of having at least one false positive among the tests is equal to . For instance, when , this probability is approximately 40%, significantly higher than the initial 5%. “Type I” error false positive N 1 - (1 - alpha)^N = 1 - 0.95^N N = 10 The problem becomes more apparent as we increase the number of hypotheses tested. Even in scenarios where only a few variants and subsamples are involved, the number of comparisons can quickly accumulate. For example, comparing three designs D1, D2, and D3 for all users, then separately for users in country C1, and again for users in countries other than C1, results in a total of nine comparisons. It is easy to unwittingly engage in multiple comparisons without realizing the subsequent inflation of Type I error rates. The resurrection of a salmon using multiple hypothesis testing Let's delve into an that highlights the consequences of not controlling “Type I” errors when testing multiple hypotheses. intriguing example In 2009, To understand how this unexpected result came about, we need to explore the nature of fMRI scans. a group of researchers conducted an fMRI scan on a dead Atlantic salmon and astonishingly discovered brain activity as if it were alive! , where numerous tests are conducted on each patient. The scanners monitor changes in blood oxygenation as an indicator of brain activity. Researchers typically focus on specific regions of interest, which leads them to partition the entire body volume into small cubes called voxels. . Due to the desire for high-resolution scans, fMRI scanners serve as extensive experimental platforms Each voxel represents a hypothesis testing the presence of brain activity within that particular cube fMRI scanners end up evaluating thousands of hypotheses during a single procedure. In this case, the researchers appropriately , and However, this outcome contradicts our understanding that the . , illustrating the significance of addressing the issue of Type I errors. intentionally did not correct the initial significance level alpha = 0.001 they identified three voxels with brain activity in the deceased salmon. salmon is indeed deceased Upon correcting the significance level, the resurrection-like phenomenon disappeared Whenever multiple hypotheses are tested within a single experiment, it becomes crucial to tackle this challenge and control Type I errors. Various statistical techniques, such as the , can be employed to mitigate the inflated false positive rate associated with multiple hypothesis testing. Bonferroni correction Bonferroni correction The is a statistical procedure specifically designed to address the challenge of multiple comparisons during hypothesis testing. Bonferroni correction Theory Suppose you need to test hypotheses during an experiment while ensuring the probability of Type I error remains below . N alpha The underlying idea of the procedure is straightforward: . Remember the formula for the probability of at least one false positive? In that formula, we have , which can be decreased to diminish the overall probability. reduce the significance level required to reject null hypothesis for each alternative hypothesis alpha So, to achieve a lower probability of at least one false positive among multiple hypotheses, you can compare each p-value not with but with something smaller. alpha But what exactly is "something smaller"? It turns out that using as the significance level for each individual hypothesis ensures that the overall probability of Type I error remains below . bonferroni_alpha = alpha / N alpha For example, if you are testing 10 hypotheses ( ) and the desired significance level is 5% ( ), you should compare each individual p-value with By doing so, the probability of erroneously rejecting at least one true null hypothesis will not exceed the desired level of 0.05. N = 10 alpha = 0.05 bonferroni_alpha = alpha / N = 0.05 / 10 = 0.005 This remarkable technique works due to , which states that . While a formal mathematical proof exists, visual explanations provide an intuitive understanding: Boole's inequality the probability of the union of events is less than or equal to the sum of their individual probabilities So, when each individual hypothesis is being tested at a significance level we have a probability of a false positive. For such tests the probability of the union of “false positive” events is less than or equal to the sum of the individual probabilities. Which, in a worst-case scenario when in all N test null hypothesis holds, is equal to bonferroni_alpha = alpha / N bonferroni_alpha N N * bonferroni_alpha = N * (alpha / N) = alpha Practice To further support the concepts discussed earlier, let's conduct . We will observe the outcomes and assess the effectiveness of the Bonferroni correction. a simulation in Python Consider a scenario where we have 10 alternative hypotheses within a single test. Let's assume that in all 10 cases, the null hypothesis is true. You agree that a significance level of is appropriate for your analysis. alpha = 0.05 However, without any correction for the inflated Type I error, we expect a theoretical probability of approximately 40% of experiencing at least one false positive result. And after applying the Bonferroni correction we expect that this probability does not exceed 5%. For a specific experiment, we either get at least one False Positive or not. These probabilities can be seen only on a scale of multiple experiments. Then let’s run the simulation of each individual experiment 100 times and calculate the number of experiments with at least one False Positive (p-value below significance level)! You can find the file with all the code to run this simulation and generate graphs in my repository on GitHub - .ipynb IgorKhomyanin/blog/bonferroni-and-salmon import numpy as np import matplotlib.pyplot as plt # To replicate the results np.random.seed(20000606) # Some hyperparameters too play with N_COMPARISONS = 10 N_EXPERIMENTS = 100 # Sample p-values # As we assume that null hypothesis is true, # the p-value would be distributed uniformly sample = np.random.uniform(0, 1, size=(N_COMPARISONS, N_EXPERIMENTS)) # Probability of type I error we are ready to accept # Probabiltiy of rejecting null hypothesis when it is actually true alpha = 0.05 # Theoretical False Positive Rate # # 1. # Probability that we cocnlude a significant difference for a given comparison # is equal to alpha by definition in our setting of true null hypothesis # Then [(1 - alpha)] is the probability of not rejecting the null hypothesis # # 2. # As experiments are considered independent, the probability of not rejecting # the null hypothesis in [(1 - alpha)]^N # # 3. # Probability that at least one is a false positive is equal to # 1 - (probability from 2.) prob_at_least_one_false_positive = 1 - ((1 - alpha) ** N_COMPARISONS) # Observed False Positive Rate # We conclude that experiment is a false positive when p-value is less than alpha false_positives_cnt = np.sum(np.sum(sample <= alpha, axis=0) > 0) false_positives_share = false_positives_cnt / N_EXPERIMENTS # Bonferroni correction bonferroni_alpha = alpha / N_COMPARISONS bonferroni_false_positive_comparisons_cnt = np.sum(np.sum(sample <= bonferroni_alpha, axis=0) > 0) bonferroni_false_positive_comparisons_share = bonferroni_false_positive_comparisons_cnt / N_EXPERIMENTS print(f'Theoretical False Positive Rate Without Correction: {prob_at_least_one_false_positive:0.4f}') print(f'Observed False Positive Rate Without Correction: {false_positives_share:0.4f} ({false_positives_cnt:0.0f} out of {N_EXPERIMENTS})') print(f'Observed False Positive Rate With Bonferroni Correction: {bonferroni_false_positive_comparisons_share:0.4f} ({bonferroni_false_positive_comparisons_cnt:0.0f} out of {N_EXPERIMENTS})') # Output: # Theoretical False Positive Rate Without Correction: 0.4013 # Observed False Positive Rate Without Correction: 0.4200 (42 out of 100) # Observed False Positive Rate With Bonferroni Correction: 0.0300 (3 out of 100) Here is a visualization of the results: The top picture represents each square as the p-value of an individual comparison (hypothesis test). The darker the square, the higher the p-value. Since we know that the null hypothesis holds in all cases, any significant result would be a false positive. The middle graph represents the experiments without correction, while the bottom graph represents the experiments with the Bonferroni correction. When the p-value is lower than the significance level (indicated by squares almost white), we reject the null hypothesis and obtain a false positive result. The experiments with at least one false positive are colored in red. Clearly, the correction worked effectively. Without correction, we observe 42 experiments out of 100 with at least one false positive, which closely aligns with the theoretical ~40% probability. However, with the Bonferroni correction, we only have 3 experiments out of 100 with at least one false positive, staying well below the desired 5% threshold. , further validating its usefulness in multiple-hypothesis testing. Through this simulation, we can visually observe the impact of the Bonferroni correction in mitigating the occurrence of false positives Conclusion In this article, I explained the concept of multiple hypothesis testing and highlighted its potential danger. This heightened probability can lead to erroneous conclusions about significant effects that may have occurred by chance if not appropriately addressed. When testing multiple hypotheses, such as conducting multiple comparisons during an A/B test, the probability of observing a rare event of “False Positive“ increases. , which adjusts the significance levels for each individual hypothesis. By leveraging Boole's inequality, this correction helps control the overall significance level at the desired threshold, reducing the risk of false positives. One possible solution to this issue is the Bonferroni correction However, it's important to recognize that every solution comes at a cost. This means that a larger sample size or stronger effects may be necessary to detect the same differences. Researchers must carefully consider this trade-off when deciding to implement the Bonferroni correction or other correction methods. When utilizing the Bonferroni correction, the required significance level decreases, which can impact the power of the experiment. If you have any questions or comments regarding the content discussed in this article, please do not hesitate to share them. Engaging in further discussion and exploration of statistical techniques is essential for better understanding and application in practice. References - My GitHub Repo with all the code needed to run the simulation and generate graphs IgorKhomyanin/blog/bonferroni-and-salmon to generate cover picture Kandinsky 2.1 Bennett, CM, MB Miller, and GL Wolford. “Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument for Multiple Comparisons Correction.” NeuroImage 47 (July 2009): S125. https://doi.org/10.1016/s1053-8119(09)71202-9 Wikipedia - Boferroni Correction Wikipedia - Boole’s inequality