3,956 reads

Navigating the Maze of Multiple Hypotheses Testing—Part 1: Essential Jargon and Common Solutions

by Viktoria BarsukovaMarch 18th, 2024

Too Long; Didn't Read

Explore the multiple comparisons problem in statistics through a humorous lens, employing Python to elucidate concepts like Bonferroni and Benjamin-Hochberg corrections, null hypotheses, and Type I errors. With entertaining party metaphors, learn how to balance risk and reward in statistical analysis, ensuring more reliable conclusions in data exploration

featured image - Navigating the Maze of Multiple Hypotheses Testing—Part 1: Essential Jargon and Common Solutions

Hi fellows!

In this two-part article, I would like to focus on a common problem in statistics - multiple comparisons.

In the first part, we will dive into the main terminology of this problem and the most common solutions. In the second part, we will explore practical implementation with Python code and interpret the results.

I will use metaphors to aid immersion in the topic and make it more fun.

Let's get started! 😎

The Multiple Comparisons Problem: A Nutshell

Imagine that you come to the party where everyone is wearing masks on their face and you are trying to guess if there is a celebrity behind a mask. The more assumptions you make the more likely you are to make a mistake at least once (hello, Type I errors!). This is the difficulty of the multiple comparisons problem in statistics: for every hypothesis you test, another pops up, increasing your chances of being wrong.

Essential Jargon for the Party

Null Hypothesis (H0): The null hypothesis is your baseline assumption that this particular guest is just a regular visitor, not a hidden celebrity. But when we are at the party there are a lot of guests around and we need to make a lot of assumptions. This is how testing multiple hypotheses appears.
Type I Error: A Type I error is when you identify some guest as a celebrity, but it turns out not true. In the language of statistics, it means that we wrongly reject the null hypothesis, thinking that we detect a real difference when there isn’t one.
Family-Wise Error Rate (FWER): FWER is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests. In other words, it is when we are afraid of making even one mistake among all our assumptions (tests). For example, let’s say we’re testing 10 hypotheses with Type I Error = 0.05, then our family-wise error = 0.05 * 10 = 0.5 (50%, Karl!). But we don’t want to take such a risk that is why we need to control the probability of mistakes somehow. Bonferroni correction comes to help us (but more on that later)
False Discovery Rate (FDR): In statistics, FDR is the expected proportion of "discoveries" (rejected null hypotheses) that are false (incorrect rejections of the null). So, we are still at the party, but we've already had a glass of sparkling wine and become more risky. This means that we are not afraid of making one mistake because we would like to catch as many real celebrities as possible. Of course, we would like to be right in the large proportion of our assumption, and here FDR-controlling procedures like The Benjamin-Hochberg Correction come to help us (but more on that later)

FWER: Bonferroni Correction

As I mentioned above, Bonferroni correction is designed for those who are afraid of making even one mistake. It demands you be extra sure about each discovery when you are looking at many possibilities at once. How does it do this? It just makes criteria for deciding significance stricter and does not allow you to choose the “wrong“ celebrity

Let's turn to our previous example with 10 hypotheses. For each finding to be considered true, it must meet a much stricter standard. If you are testing 10 hypotheses and your standard certainty level is 0.05, Bonferroni adjusts this to 0.005 for each test.

Formula:
Adjusted significance level = α / n
- α is your initial level of certainty (usually 0.05)
- n is the number of hypotheses you are testing
Impact:
This method greatly reduces the chance of false discoveries (Type I errors) by setting the bar higher for what counts as a significant result. However, its strictness can also prevent you from recognizing true findings, like you don't recognize a celebrity because you are too focused on not making a mistake.

In essence, the Bonferroni correction prioritizes avoiding false positives at the risk of missing out on true discoveries, making it a conservative choice in hypothesis testing.

FDR: The Benjamin-Hochberg Correction

As we have already discussed, The Benjamin-Hochberg correction is like a more risky guy who allows you to confidently identify celebrities without being too strict.

This method adjusts the significance levels based on the rank of each p-value, controlling FDR. This approach allows more flexibility compared to the Bonferroni correction.

The Process:

Rank P-values: From the smallest to the largest.
Adjust Significance Levels: For each hypothesis, it calculates a different threshold, which becomes more lenient for hypotheses with smaller p-values. This is based on their rank and the total number of tests (more details can be found in the next part of this article)

So, by focusing on controlling FDR, the Benjamin-Hochberg correction allows you to find more celebrities among all the guests at the party. This approach is particularly useful when you variety of hypotheses and agree on some level of making mistakes in order not to miss out on important findings.

In summary, the Benjamin-Hochberg correction offers a practical balance between discovering true effects and controlling the rate of false positives

In conclusion, we discussed the main terminology of multiple comparison problem and the most common ways to deal with them. In the next part, I will focus on a practice interpretation with Python code.

See you!