This story draft by @escholar has not been reviewed by an editor, YET.

Safe Testing for Large-Scale Experimentation Platforms: Safe t-test on OCE datasets

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Table of Links

  1. Introduction

  2. Hypothesis testing

    2.1 Introduction

    2.2 Bayesian statistics

    2.3 Test martingales

    2.4 p-values

    2.5 Optional Stopping and Peeking

    2.6 Combining p-values and Optional Continuation

    2.7 A/B testing

  3. Safe Tests

    3.1 Introduction

    3.2 Classical t-test

    3.3 Safe t-test

    3.4 χ2 -test

    3.5 Safe Proportion Test

  4. Safe Testing Simulations

    4.1 Introduction and 4.2 Python Implementation

    4.3 Comparing the t-test with the Safe t-test

    4.4 Comparing the χ2 -test with the safe proportion test

  5. Mixture sequential probability ratio test

    5.1 Sequential Testing

    5.2 Mixture SPRT

    5.3 mSPRT and the safe t-test

  6. Online Controlled Experiments

    6.1 Safe t-test on OCE datasets

  7. Vinted A/B tests and 7.1 Safe t-test for Vinted A/B tests

    7.2 Safe proportion test for sample ratio mismatch

  8. Conclusion and References

6.1 Safe t-test on OCE datasets

In order to benchmark the performance of the safe t-test, we can compare its results with the t-test. As we’ve seen in Figure 4 (right), the two tests do not always reach the same conclusion for each set of data. However, since the t-test is the most widely used statistical test for A/B testing, it is important to contrast the results in order to understand the situations in which the results differ. Table 5 shows the results of the t-test and the safe t-test on the collection of OCE datasets.


Table 5: Decisions of the safe t-test and the classical t-test on the OCE datasets.


The safe t-test detects many more effects than the classical t-test. While, in theory, the false positive rate of the safe t-test should be below α, it seems unlikely that all of these rejections of H0 correspond to true effects. Following analysis of the behaviour of the E-values over the course of these experiments, we conclude that the high number of H0 rejections likely has to do with novelty effect. As mentioned previously, the novelty effect refers to an increased attention for the feature shortly after its release. The result is that the assumption of independent and identically distributed data is violated, with evidence against the null hypothesis to accumulate rapidly. For a fixed-sample test this is less of an issue because the distribution reverts over the course of an experiment. However, for safe tests this can cause a rejection of H0 before the true impact of the feature is determined. This fact is particularly relevant to practitioners seeking to implement anytime-valid statistical testing. Next, in Table 6, we compare the safe test and the mSPRT on the OCE datasets.


Table 6: Decisions of the safe t-test and the mSPRT on the OCE datasets.


Unsurprisingly given the behaviour observed in Figure 8, the null hypotheses rejected by the mSPRT are similarly rejected by the safe t-test. However, the safe test rejects even more of the hypotheses than the mSPRT. This is likely because the safe test is more sensitive than the mSPRT and reacts more strongly to data which contradict the null hypothesis. In the next section, we continue analyzing the performance of safe tests at a large-scale tech company, Vinted.


Author:

(1) Daniel Beasley


This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 4.0 INTERNATIONAL license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks