Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 1.1 Related literature 1.1 Related literature Types of Metrics and Their Hypothesis and 2.1 Types of metrics 2.2 Hypotheses for different types of metrics Type I and Type II Error Rates for Decision Rules including Superiority and Non-Inferiority Tests 3.1 The composite hypotheses of superiority and non-inferiority tests 3.2 Bounding the type I and type II error rates for UI and IU testing 3.3 Bounding the error rates for a decision rule including both success and guardrail metrics 3.4 Power corrections for non-inferiority testing Extending the Decision Rule with Deterioration and Quality Metrics Monte Carlo Simulation Study 5.1 Results Discussion and Conclusions Types of Metrics and Their Hypothesis and 2.1 Types of metrics 2.2 Hypotheses for different types of metrics Types of Metrics and Their Hypothesis and 2.1 Types of metrics Types of Metrics and Their Hypothesis and 2.1 Types of metrics 2.2 Hypotheses for different types of metrics 2.2 Hypotheses for different types of metrics Type I and Type II Error Rates for Decision Rules including Superiority and Non-Inferiority Tests 3.1 The composite hypotheses of superiority and non-inferiority tests 3.2 Bounding the type I and type II error rates for UI and IU testing 3.3 Bounding the error rates for a decision rule including both success and guardrail metrics 3.4 Power corrections for non-inferiority testing Type I and Type II Error Rates for Decision Rules including Superiority and Non-Inferiority Tests Type I and Type II Error Rates for Decision Rules including Superiority and Non-Inferiority Tests 3.1 The composite hypotheses of superiority and non-inferiority tests 3.1 The composite hypotheses of superiority and non-inferiority tests 3.2 Bounding the type I and type II error rates for UI and IU testing 3.2 Bounding the type I and type II error rates for UI and IU testing 3.3 Bounding the error rates for a decision rule including both success and guardrail metrics 3.3 Bounding the error rates for a decision rule including both success and guardrail metrics 3.4 Power corrections for non-inferiority testing 3.4 Power corrections for non-inferiority testing Extending the Decision Rule with Deterioration and Quality Metrics Extending the Decision Rule with Deterioration and Quality Metrics Extending the Decision Rule with Deterioration and Quality Metrics Monte Carlo Simulation Study 5.1 Results Monte Carlo Simulation Study Monte Carlo Simulation Study 5.1 Results 5.1 Results Discussion and Conclusions Discussion and Conclusions Discussion and Conclusions APPENDIX A: IMPROVING THE EFFICIENCY OF PROPOSITION 4.1 WITH ADDITIONAL ASSUMPTIONS APPENDIX A: APPENDIX A: IMPROVING THE EFFICIENCY OF PROPOSITION 4.1 WITH ADDITIONAL ASSUMPTIONS APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES APPENDIX B: APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION APPENDIX C: APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION APPENDIX D: USING NYHOLT’S METHOD OF EFFICIENT NUMBER OF INDEPENDENT TESTS APPENDIX D: APPENDIX D: USING NYHOLT’S METHOD OF EFFICIENT NUMBER OF INDEPENDENT TESTS Acknowledgments and References Acknowledgments and References 4. EXTENDING THE DECISION RULE WITH DETERIORATION AND QUALITY METRICS Deterioration tests are inferiority tests that aim to capture regressions in metrics. They can be applied to metrics already present as a success or a guardrail metric, or to metrics that are only included in this category as deterioration metrics. The test checks if the metric is significantly deteriorating. That is, if the treatment group is significantly inferior to the control group with respect to the metric of interest. Deterioration tests for success and guardrail metrics attempt to identify significant regressions, which would, if they exist, speak against the success of the experiment. Neither the superiority test used for success metrics or the non-inferiority test used for guardrail metrics would on its own indicate a regression. Knowing when regressions occur is essential when running experiments. In practice, in addition to success, guardrail and detrioriation metrics, also experiment-quality metrics are used. For example, at Spotify, we include a set of tests and metrics that evaluate the quality of the experiment. These include a test for balanced traffic through a sample ratio mismatch test, and a test for pre-exposure bias. By including deterioration and quality tests in the decision rule, the complexity to manage the risks of an incorrect decision increases. Decision Rule 2 formalizes the complete decision rule that is used at Spotify. DECISION RULE 2. Ship the change if and only if: at least one success metric is significantly superior all guardrail metrics are significantly non-inferior none of the success, guardrail or deterioration metrics are significantly inferior none of the quality tests are significantly rejecting quality at least one success metric is significantly superior at least one success metric is significantly superior all guardrail metrics are significantly non-inferior all guardrail metrics are significantly non-inferior none of the success, guardrail or deterioration metrics are significantly inferior none of the success, guardrail or deterioration metrics are significantly inferior none of the quality tests are significantly rejecting quality none of the quality tests are significantly rejecting quality That is, ship if and only if the global superiority/noninferiority null hypothesis is rejected in favor of the alternative hypothesis, the inferiority null hypothesis is not rejected for any metric, and no quality test is significant. Proposition 4.1 displays the corresponding correction, to bound the error rates of Decision Rule 2. This directly implies that In other words, the effect of the deterioration tests can only make the false positive rate lower than α. This concludes the first part of the proof. Plugging this into the above yield To find the corrected β for achieving the intended type II error rate, we solve The condition that α_ < β should be interpreted as that is only possible to correct the false negative rate (and thus ensuring the intended power) for decision rules that include deterioration and/or quality tests as long as the intended false negative rate for the decision is larger than the intended false positive rate for those tests. This is quite natural, if we add a large enough chance of rejecting no deterioration or quality, even when there is no deteriration or problem with the quality, this will at some point (for some α_) limit our ability to find a positive decision, regardless of the sample size. For success and guardrail metrics, there is a dependency between the rejection of the deterioration test and the superiority or non-inferiority test for any given metric, respectively. That is, if the superiority or non-inferiority test is rejected, this affects the probability of the deterioration test to be rejected too. It is possible to utilize this dependency to improve the efficiency of Proposition 4.1 slightly by making additional assumptions about the relation between α, α_, and β. See Appendix A for details. Authors: (1) Mårten Schultzberg, Experimentation Platform team, Spotify, Stockholm, Sweden; (2) Sebastian Ankargren, Experimentation Platform team, Spotify, Stockholm, Sweden; (3) Mattias Frånberg, Experimentation Platform team, Spotify, Stockholm, Sweden. Authors: Authors: (1) Mårten Schultzberg, Experimentation Platform team, Spotify, Stockholm, Sweden; (2) Sebastian Ankargren, Experimentation Platform team, Spotify, Stockholm, Sweden; (3) Mattias Frånberg, Experimentation Platform team, Spotify, Stockholm, Sweden. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv available on arxiv