paint-brush
Balancing Type I and Type II Errors in A/B Testing Decisionsby@abtest

Balancing Type I and Type II Errors in A/B Testing Decisions

by AB TestMarch 30th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

A/B testing decisions rely on statistical error control. This section explores superiority and non-inferiority tests, Bonferroni adjustments, and how multiple-testing corrections shape experiment outcomes.

Coin Mentioned

Mention Thumbnail
featured image - Balancing Type I and Type II Errors in A/B Testing Decisions
AB Test HackerNoon profile picture
0-item

Abstract and 1 Introduction

1.1 Related literature

  1. Types of Metrics and Their Hypothesis and 2.1 Types of metrics

    2.2 Hypotheses for different types of metrics

  2. Type I and Type II Error Rates for Decision Rules including Superiority and Non-Inferiority Tests

    3.1 The composite hypotheses of superiority and non-inferiority tests

    3.2 Bounding the type I and type II error rates for UI and IU testing

    3.3 Bounding the error rates for a decision rule including both success and guardrail metrics

    3.4 Power corrections for non-inferiority testing

  3. Extending the Decision Rule with Deterioration and Quality Metrics

  4. Monte Carlo Simulation Study

    5.1 Results

  5. Discussion and Conclusions


APPENDIX A: IMPROVING THE EFFICIENCY OF PROPOSITION 4.1 WITH ADDITIONAL ASSUMPTIONS

APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES

APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION

APPENDIX D: USING NYHOLT’S METHOD OF EFFICIENT NUMBER OF INDEPENDENT TESTS


Acknowledgments and References

3. TYPE I AND TYPE II ERROR RATES FOR DECISION RULES INCLUDING SUPERIORITY AND NON-INFERIORITY TESTS

The categorization of metrics into success, guardrail, deterioration and quality metrics naturally paves the way for a decision rule that combines the results of metrics in each category appropriately. The decision rule, in turn, implies various multiple-testing corrections for the statistical inference to control both the type I and type II error rates of the shipping decision as intended. In this section, we start by establishing some fundamental results for superiority and non-inferiority testing, and, more generally, for union-intersection and intersection-union testing that these rely on. These results are then used to construct a decision rule that includes superiority and noninferiority tests. In subsequent sections, we include deterioration tests and finally quality metrics.


TABLE 2Hypotheses for the the three main types of metrics considered in the decision rules, where δ is the estimand, like the average treatment effect, of interest. The hypotheses of the quality tests are left out because they typically are not using the same kind of estimands as the others with varying hypotheses as a consequence. The minimum detectable effect (MDE) is the effect size used for success metrics when designing the experiment. The non-inferiority margin (NIM) is the tolerance level of regression used in the non-inferiority tests used for guardrail metrics. Status quo is not a hypothesis in the traditional sense, but a scenario of interest later in the paper.


The goal of the designs in this paper is to bound the family-wise error rates of the decision. In the following, we focus our discussion on Bonferroni-based adjustments for multiple comparisons that let us use results of individual tests to evaluate the joint global hypothesis. There are two main contributing factors to why we choose this approach:


  1. Interpretability is greatly simplified when experimenters can view individual results for metrics (including confidence intervals).


  2. By evaluating a global hypothesis consisting of individual of hypotheses for multiple metrics through individual tests, we can fit our framework into a large, scalable experimentation platform where a decision rule approach fits seamlessly into other core experimentation functionality like sequential testing, variance reduction, and more.


While other multiple-testing adjustment procedures often are more powerful, they generally achieve more power at the expense of interpretability and ease of understanding. For example, Holm’s multiple correction method [7] is generally more powerful than Bonferroni’s, but the associated confidence intervals under are more complicated and do not always yield finite bounds [6]. Even when confidence intervals are available, explaining the correction and how it affects the intervals to experimenters is nontrivial.


3.1 The composite hypotheses of superiority and non-inferiority tests




DEFINITION 3.1 (At-least-one testing for superiority). A composite hypothesis for superiority that is rejected if at least one of the S subhypotheses is rejected is described by





Definition 3.1 says that we reject the global null hypothesis if any of the subhypotheses are rejected. This is the decision rule that most multiple testing corrections bound the type I error rate for, including the Bonferroni and Sidak corrections. In other words, if any metric improves significantly, we ship the product change.


DEFINITION 3.2 (All-or-none testing for non-inferiority). A composite hypothesis for non-inferiority that is rejected if all the G subhypotheses is rejected is described by





Definition 3.2 says that we will reject the global null hypothesis only if all subhypotheses are rejected. This kind of global hypothesis, although it is arguably natural for non-inferiority tests, has not received much attention in the online experimentation literature.


The two testing procedures outlined in Definition 3.1 and 3.2 are both well-studied in statistics, where they are known as union-intersection (UI) and intersection-union (IU) testing, respectively [3].




Authors:

(1) Mårten Schultzberg, Experimentation Platform team, Spotify, Stockholm, Sweden;

(2) Sebastian Ankargren, Experimentation Platform team, Spotify, Stockholm, Sweden;

(3) Mattias Frånberg, Experimentation Platform team, Spotify, Stockholm, Sweden.


This paper is available on arxiv under CC BY 4.0 DEED license.