Table of Links
-
Types of Metrics and Their Hypothesis and 2.1 Types of metrics
-
Type I and Type II Error Rates for Decision Rules including Superiority and Non-Inferiority Tests
3.1 The composite hypotheses of superiority and non-inferiority tests
3.2 Bounding the type I and type II error rates for UI and IU testing
3.3 Bounding the error rates for a decision rule including both success and guardrail metrics
-
Extending the Decision Rule with Deterioration and Quality Metrics
APPENDIX A: IMPROVING THE EFFICIENCY OF PROPOSITION 4.1 WITH ADDITIONAL ASSUMPTIONS
APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES
APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION
APPENDIX D: USING NYHOLT’S METHOD OF EFFICIENT NUMBER OF INDEPENDENT TESTS
Acknowledgments and References
2. TYPES OF METRICS AND THEIR HYPOTHESES
Throughout the paper, we assume that the aim of the statistical analysis of an experiment is to make a binary decision regarding the success of the treatment. This decision can be e.g. whether or not to go ahead with the next step of validation for a medical treatment, or whether or not to ship a product change. Our goal is to bound the type I and type II error for this decision in repeated experimentation. We will focus on the product decision example, but the results are applicable in a wider, more general setting.
2.1 Types of metrics
In modern online experimentation, experiments are evaluated using multiple metrics. Based on the results from each metric, typically estimates of the average treatment effects, the experimenters make a decision whether to ship the feature more widely. The heuristics underlying the decision-making process are seldom transparent. In large organizations, these decision processes often vary substantially from team to team. At Spotify, we have introduced a standardized way of providing a recommendation for a suggested course of action given the outcomes for a set of metrics that belong to different categories. We call this recommendation a shipping recommendation. These recommendations are powered by a decision rule that includes four types of metrics and their associated hypotheses and tests, given by
-
Success metrics. Metrics that we aim to improve, tested with superiority tests.
-
Guardrail metrics. Metrics that we do not want to see deteriorate more than a certain threshold, tested with non-inferiority tests.
-
Deterioration metrics. Metrics that should not deteriorate, tested with inferiority tests.
-
Quality metrics. Metrics that verify the integrity and validity of the experiment itself.
A metric can belong to multiple categories. For example, at Spotify, all success and guardrail metrics also belong to deterioration metrics. We will elaborate on the details of this in later sections, but the implication is that we monitor all metrics for regressions, even if our goal and hypothesis is for them to improve. Quality metrics are not always metrics in the traditional sense. For example, a crucial experiment-quality metric is the sample ratio mismatch test [4]. This is not a metric in the traditional sense, but rather a goodness of fit test of proportions. Table 1 shows
an example of a set of metrics used in an experiment. In this case, the experiment attempts to increase the minutes played of music, but includes podcast minutes played to verify that overall consumption does not increase at the expense of podcast consumption. Both metrics are also included as deterioration metrics to ensure that they are not moving in the direction that is opposite from what we would expect and hope. Additionally, the share of users that experience a crash is included as deterioration metric. The only quality metric in this case is a sample ratio mismatch metric.
In the following, we use "inferiority test" to equivalently mean "deterioration test", which, given our standarization that a decrease is a regression, is a test for the deterioration of a metric. With some abuse of language, we sometimes say "metric x was significantly superior" to mean that the "treatment was significantly superior to control with respect to metric x".
2.2 Hypotheses for different types of metrics
The different categories of metrics serve different purposes, which by extension means that their associated statistical hypotheses are different. Table 2 displays the hypotheses for these categories of metrics used in the decision rules considered in this paper. While the hypotheses are similar–––and to some degree even opposites of one another–––they give rise to distinct interpretations. For example, the alternative of the non-inferiority test for which we design the experiment to be powered is the null hypothesis for the superiority test, which means that under the null hypothesis a guardrail metric has deteriorated by NIM (non-inferiority margin). We also consider a third hypothesis-like scenario, the "status quo" in which no metric has moved, to facilitate our discussion. The status quo scenario is thus under the alternative for the noninferiority and under the null hypothesis for the superiority tests and deterioration tests.
Authors:
(1) Mårten Schultzberg, Experimentation Platform team, Spotify, Stockholm, Sweden;
(2) Sebastian Ankargren, Experimentation Platform team, Spotify, Stockholm, Sweden;
(3) Mattias Frånberg, Experimentation Platform team, Spotify, Stockholm, Sweden.
This paper is