117 reads

How Spotify Standardizes Multi-Metric Experiment Analysis

by AB TestMarch 30th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

A/B testing with multiple outcomes requires structured decision-making. Spotify draws insights from decision theory, OECs, and clinical trials to refine its approach, ensuring reliable, scalable experimentation.

Company Mentioned

Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - How Spotify Standardizes Multi-Metric Experiment Analysis
AB Test HackerNoon profile picture
0-item

Abstract and 1 Introduction

1.1 Related literature

  1. Types of Metrics and Their Hypothesis and 2.1 Types of metrics

    2.2 Hypotheses for different types of metrics

  2. Type I and Type II Error Rates for Decision Rules including Superiority and Non-Inferiority Tests

    3.1 The composite hypotheses of superiority and non-inferiority tests

    3.2 Bounding the type I and type II error rates for UI and IU testing

    3.3 Bounding the error rates for a decision rule including both success and guardrail metrics

    3.4 Power corrections for non-inferiority testing

  3. Extending the Decision Rule with Deterioration and Quality Metrics

  4. Monte Carlo Simulation Study

    5.1 Results

  5. Discussion and Conclusions


APPENDIX A: IMPROVING THE EFFICIENCY OF PROPOSITION 4.1 WITH ADDITIONAL ASSUMPTIONS

APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES

APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION

APPENDIX D: USING NYHOLT’S METHOD OF EFFICIENT NUMBER OF INDEPENDENT TESTS


Acknowledgments and References

Decision theory is a formal mathematical framework for formalizing decision problems under uncertainty, see e.g. [1] for an introduction. Although this theory is comprehensive and flexible, it is non-trivial for most people, and it moves the decision problem far from the (to most experimenters) familiar hypothesis-testing realm. Since modern tech companies are often having many teams experimenting independently, it is not plausible to have decision theory experts available to all.


Another popular alternative for decision making in A/B tests where the goal is to evaluate using several outcomes is to use a so-called overall evaluation criterion (OEC), see e.g. [12] for a recent introduction. An OEC removes the problem with several possibly contradictory results by using just a single metric. This metric can either be a metric that serves as a proxy for all the necessary aspects of the important outcomes, or it is a function, like a linear combination, of a selected set of metrics. Designing an appropriate OEC generally requires prolonged research and strong alignment within the organization. For larger companies, a single OEC may not even suffice because the business itself is diverse. Moreover, if the OEC is a complicated function of several outcomes, it can be difficult for experimenters to understand their results. Even when an OEC is used, it is not common that this metrics include all quality tests and metrics that should not deteriorate. That is, an OEC is typically a metric that trades off various outcomes to define success, but as we will show in the paper, a product decision rule can also explicitly include the efforts to avoid end-user harm and experiment invalidity which in turn affects the power analysis and design of the experiment.


In the clinical trial literature (for an overview, see 3, ch. 4; 16), so-called multiple endpoint experiments, are experiments with more than one outcome (metric). The endpoints can be both efficacy (success metrics) and safety endpoints (guardrail metrics). In some trials, there are also primary endpoints and secondary endpoints, where the secondary endpoints are only evaluated if primary endpoints are significantly changed. Clearly, this setting closely resembles the online experimentation setting. Similarly to deciding whether a new drug or treatment is safe and effective, deciding whether to ship a new feature is a composite decision that involves a potentially complex interplay between all endpoints. Various experimental design and analysis methods have been applied in the clinical trial setting [13]: hierarchical testing where primary endpoints are tested before secondary [2], global assessment measures where the endpoints are aggregated within each patient first then analyzed with standard statistical methods [15], closed testing where a global null is tested first before proceeding to more specific hypotheses [10]. However, at companies like Spotify, there’s a strong need to standardize the design, analysis and decision process of multiple endpoint experiments to allow non-statisticians to evaluate product changes in a safe and efficient way.


Authors:

(1) Mårten Schultzberg, Experimentation Platform team, Spotify, Stockholm, Sweden;

(2) Sebastian Ankargren, Experimentation Platform team, Spotify, Stockholm, Sweden;

(3) Mattias Frånberg, Experimentation Platform team, Spotify, Stockholm, Sweden.


This paper is available on arxiv under CC BY 4.0 DEED license.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks