Table of Links
-
Types of Metrics and Their Hypothesis and 2.1 Types of metrics
-
Type I and Type II Error Rates for Decision Rules including Superiority and Non-Inferiority Tests
3.1 The composite hypotheses of superiority and non-inferiority tests
3.2 Bounding the type I and type II error rates for UI and IU testing
3.3 Bounding the error rates for a decision rule including both success and guardrail metrics
-
Extending the Decision Rule with Deterioration and Quality Metrics
APPENDIX A: IMPROVING THE EFFICIENCY OF PROPOSITION 4.1 WITH ADDITIONAL ASSUMPTIONS
APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES
APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION
APPENDIX D: USING NYHOLT’S METHOD OF EFFICIENT NUMBER OF INDEPENDENT TESTS
Acknowledgments and References
5. MONTE CARLO SIMULATION STUDY
In this section we run a simulation study to illustrate the empirical error rates for the multi-metric decision rules with and without the alpha and power corrections[1]. To make the simulation more relevant, all deterioration and quality tests use Group Sequential Tests (GST). All noninferiority and superiority tests use fixed-horizon z-tests. See Appendix C for a discussion about combining sequential tests and fixed horizon tests for the same metric. For the GSTs, we analyze the results 10 times during the data collection at evenly spaced intervals. We generate data from a multivariate normal distribution, and treat the variance as known in the tests to be able to keep the sample size small. We use S = G = 5 and D = Q = 2, and repeat the simulation 100,000 times for each setting. In all scenarios, α + = α − = 0.05 and β = 0.2. We compare three designs: no correction, only alpha correction, and Proposition 4.1 with Remark A.1. In the simulation study, we vary the following:
• Hypothesis under which the simulation is performed
– H0: the null of the non-inferiority and superiority tests
– Status quo: the null of the superiority and the alternative for the non-inferiority tests
– H1: the alternative for non-inferiority and superiority tests
• Dependence structure
– Independent: All metrics independent – Dependent: All pairwise correlations 0.99
– Block 1: all guardrail metrics independent of each other and the success metrics, but all success metrics have pairwise correlation 0.99
– Block 2: all success metrics independent of each other and the guardrail metrics, but all guardrail metrics have pairwise correlation 0.99
For all settings, all additional deterioration and quality metrics are generated as independent of each other and all other metrics with a zero effect.
5.1 Results
For convenience, the results are split by scenario: H0, Status quo, and H1. We present the rejection rates for the following groups of tests:
We also present the rejection rates for the three decision rules we develop in the paper, however, note that the design is always under Proposition 4.1.
5.1.1 Results under the global H0 Table 3 displays the result under the the null hypotheses of the non-inferiority and the superiority tests. As expected, all three decision rules are conservative under all settings. This is explained by two things. First, the power of the deterioration test on the guardrail metrics is very high under the null hypothesis of the non-inferiority tests. Second, the probability of rejecting all guardrail metrics simultaneously is very low, with the exception of when all guardrail metrics are strongly correlated, which is confirmed for the Dependent and Block 1 covariance settings. As stated before, the H0 scenario is arguably of little practical relevance, because the null hypothesis of the non-inferiority is decided by the experimenter. A more relevant scenario, in our opinion, is the status quo scenario for which the next section displays the results.
5.1.2 Results under status quo Table 4 displays the result under the the alternative hypothesis of the noninferiority and the null hypothesis of the superiority tests. The results clearly show that Proposition 4.1 is the correction that has the highest type I error rate, while not crossing the intended α = 0.05. The Only Alpha correction bounds the type I error rate, but is conservative due to the lack of simultaneous power for the guardrail metrics. As expected, the addition of the deterioration and quality tests does make the rejection rate more conservative, but on a magnitude that has little practical relevance, especially as compared to, e.g., the impact of including beta corrections in the analysis.
5.1.3 Results under the global H1 Table 5 displays the result under the the alternative hypotheses of the noninferiority and the superiority tests. In this setting, the need for the power correction imposed by Proposition 4.1 is clear. For the other two corrections, the rates with which the guardrail metrics are simultaneously significantly non-inferior (RG) do not reach the intended power of 80%, which implies that neither can the decision rules. In the settings where the guardrail metrics are all independent (Independent, and Block 1), the rejection rate under no correction and the Only Alpha correction are as low as 30% for both decision rules—less than half of the intended power.
The results under H1 show that Proposition 4.1 bounds the error rates also in the worst-case scenarios, which leads to a higher power than desired in the best-case scenarios. For example, Proposition 4.1 ensures that even under the Block 2 covariance matrix, where all guardrail metrics are independent and all success metrics are dependent, the power is above 80%. However, to ensure that level of power under the worst-case scenario, the power instead exceeds 94% in the best-case scenario.
Authors:
(1) Mårten Schultzberg, Experimentation Platform team, Spotify, Stockholm, Sweden;
(2) Sebastian Ankargren, Experimentation Platform team, Spotify, Stockholm, Sweden;
(3) Mattias Frånberg, Experimentation Platform team, Spotify, Stockholm, Sweden.
This paper is
[1] Code for replication can be found at https://github.com/ MSchultzberg/Risk-management-paper-2024.