Table of Links
-
Types of Metrics and Their Hypothesis and 2.1 Types of metrics
-
Type I and Type II Error Rates for Decision Rules including Superiority and Non-Inferiority Tests
3.1 The composite hypotheses of superiority and non-inferiority tests
3.2 Bounding the type I and type II error rates for UI and IU testing
3.3 Bounding the error rates for a decision rule including both success and guardrail metrics
-
Extending the Decision Rule with Deterioration and Quality Metrics
APPENDIX A: IMPROVING THE EFFICIENCY OF PROPOSITION 4.1 WITH ADDITIONAL ASSUMPTIONS
APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES
APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION
APPENDIX D: USING NYHOLT’S METHOD OF EFFICIENT NUMBER OF INDEPENDENT TESTS
Acknowledgments and References
3. TYPE I AND TYPE II ERROR RATES FOR DECISION RULES INCLUDING SUPERIORITY AND NON-INFERIORITY TESTS
The categorization of metrics into success, guardrail, deterioration and quality metrics naturally paves the way for a decision rule that combines the results of metrics in each category appropriately. The decision rule, in turn, implies various multiple-testing corrections for the statistical inference to control both the type I and type II error rates of the shipping decision as intended. In this section, we start by establishing some fundamental results for superiority and non-inferiority testing, and, more generally, for union-intersection and intersection-union testing that these rely on. These results are then used to construct a decision rule that includes superiority and noninferiority tests. In subsequent sections, we include deterioration tests and finally quality metrics.
The goal of the designs in this paper is to bound the family-wise error rates of the decision. In the following, we focus our discussion on Bonferroni-based adjustments for multiple comparisons that let us use results of individual tests to evaluate the joint global hypothesis. There are two main contributing factors to why we choose this approach:
-
Interpretability is greatly simplified when experimenters can view individual results for metrics (including confidence intervals).
-
By evaluating a global hypothesis consisting of individual of hypotheses for multiple metrics through individual tests, we can fit our framework into a large, scalable experimentation platform where a decision rule approach fits seamlessly into other core experimentation functionality like sequential testing, variance reduction, and more.
While other multiple-testing adjustment procedures often are more powerful, they generally achieve more power at the expense of interpretability and ease of understanding. For example, Holm’s multiple correction method [7] is generally more powerful than Bonferroni’s, but the associated confidence intervals under are more complicated and do not always yield finite bounds [6]. Even when confidence intervals are available, explaining the correction and how it affects the intervals to experimenters is nontrivial.
3.1 The composite hypotheses of superiority and non-inferiority tests
DEFINITION 3.1 (At-least-one testing for superiority). A composite hypothesis for superiority that is rejected if at least one of the S subhypotheses is rejected is described by
Definition 3.1 says that we reject the global null hypothesis if any of the subhypotheses are rejected. This is the decision rule that most multiple testing corrections bound the type I error rate for, including the Bonferroni and Sidak corrections. In other words, if any metric improves significantly, we ship the product change.
DEFINITION 3.2 (All-or-none testing for non-inferiority). A composite hypothesis for non-inferiority that is rejected if all the G subhypotheses is rejected is described by
Definition 3.2 says that we will reject the global null hypothesis only if all subhypotheses are rejected. This kind of global hypothesis, although it is arguably natural for non-inferiority tests, has not received much attention in the online experimentation literature.
The two testing procedures outlined in Definition 3.1 and 3.2 are both well-studied in statistics, where they are known as union-intersection (UI) and intersection-union (IU) testing, respectively [3].
Authors:
(1) Mårten Schultzberg, Experimentation Platform team, Spotify, Stockholm, Sweden;
(2) Sebastian Ankargren, Experimentation Platform team, Spotify, Stockholm, Sweden;
(3) Mattias Frånberg, Experimentation Platform team, Spotify, Stockholm, Sweden.
This paper is