Safe Testing for Large-Scale Experimentation Platforms: Safe t-test on OCE datasets

Table of Links

Introduction
Hypothesis testing

2.1 Introduction

2.2 Bayesian statistics

2.3 Test martingales

2.4 p-values

2.5 Optional Stopping and Peeking

2.6 Combining p-values and Optional Continuation

2.7 A/B testing
Safe Tests

3.1 Introduction

3.2 Classical t-test

3.3 Safe t-test

3.4 χ2 -test

3.5 Safe Proportion Test
Safe Testing Simulations

4.1 Introduction and 4.2 Python Implementation

4.3 Comparing the t-test with the Safe t-test

4.4 Comparing the χ2 -test with the safe proportion test
Mixture sequential probability ratio test

5.1 Sequential Testing

5.2 Mixture SPRT

5.3 mSPRT and the safe t-test
Online Controlled Experiments

6.1 Safe t-test on OCE datasets
Vinted A/B tests and 7.1 Safe t-test for Vinted A/B tests

7.2 Safe proportion test for sample ratio mismatch
Conclusion and References

6.1 Safe t-test on OCE datasets

In order to benchmark the performance of the safe t-test, we can compare its results with the t-test. As we’ve seen in Figure 4 (right), the two tests do not always reach the same conclusion for each set of data. However, since the t-test is the most widely used statistical test for A/B testing, it is important to contrast the results in order to understand the situations in which the results differ. Table 5 shows the results of the t-test and the safe t-test on the collection of OCE datasets.

The safe t-test detects many more effects than the classical t-test. While, in theory, the false positive rate of the safe t-test should be below α, it seems unlikely that all of these rejections of H0 correspond to true effects. Following analysis of the behaviour of the E-values over the course of these experiments, we conclude that the high number of H0 rejections likely has to do with novelty effect. As mentioned previously, the novelty effect refers to an increased attention for the feature shortly after its release. The result is that the assumption of independent and identically distributed data is violated, with evidence against the null hypothesis to accumulate rapidly. For a fixed-sample test this is less of an issue because the distribution reverts over the course of an experiment. However, for safe tests this can cause a rejection of H0 before the true impact of the feature is determined. This fact is particularly relevant to practitioners seeking to implement anytime-valid statistical testing. Next, in Table 6, we compare the safe test and the mSPRT on the OCE datasets.

Unsurprisingly given the behaviour observed in Figure 8, the null hypotheses rejected by the mSPRT are similarly rejected by the safe t-test. However, the safe test rejects even more of the hypotheses than the mSPRT. This is likely because the safe test is more sensitive than the mSPRT and reacts more strongly to data which contradict the null hypothesis. In the next section, we continue analyzing the performance of safe tests at a large-scale tech company, Vinted.

Author:

(1) Daniel Beasley

This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 4.0 INTERNATIONAL license.

Safe Testing for Large-Scale Experimentation Platforms: Safe t-test on OCE datasets

Table of Links

6.1 Safe t-test on OCE datasets

About Author

Topics

Around The Web...

Trending Topics

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps