The Conflict Between Speed and Statistical Rigor
In the fast-paced world of B2C experimentation, speed is currency. Product managers and executives constantly monitor dashboards, hoping to see a significant uplift so they can roll out a winning feature and move to the next hypothesis.
However, traditional A/B testing (Fixed Horizon) operates under a strict constraint: you must determine the sample size in advance and wait until it is fully collected. Checking results intermediate - peeking - and stopping if the p-value is less than 0.05 is a violation of statistical validity.
If you peek at the data daily during a month-long experiment, your False Positive Rate (Type I Error) inflates from the standard 5% to upwards of 30%. This means nearly a third of your "successful" experiments might actually be neutral changes, driven purely by random noise.
This article explores Sequential Testing, specifically the mSPRT (Mixture Sequential Probability Ratio Test) method used by companies like Uber and Netflix. We will examine how this approach allows for continuous monitoring and early stopping while maintaining strict error control, and demonstrate how to implement it using the open-source Python library trustworthy-experiments-core.
From Fixed Horizons to Anytime-Valid Inference
To solve the peeking problem, we need a statistical framework that adjusts the significance threshold based on the amount of evidence gathered.
The Limits of Group Sequential
Historically, analysts used Group Sequential designs (like O'Brien-Fleming or Pocock boundaries). These methods require you to pre-define the number of interim checks ("we will check results exactly 4 times"). While effective, they lack flexibility. If a test shows a massive regression on day 2 between scheduled checks, the framework technically does not support stopping.
The mSPRT Approach
Modern experimentation platforms increasingly rely on Confidence Sequences derived from mSPRT. This method provides Anytime-Valid Inference, meaning you can check the results as often as you like (even after every single user) without inflating the error rate.
Instead of a fixed p-value, mSPRT uses the Likelihood Ratio. It compares how likely the data is under the Alternative Hypothesis (H1) versus the Null Hypothesis (H0). Since we do not know the exact effect size beforehand, we mix the likelihoods over a probability distribution of possible effects.
The resulting test statistic behaves as a martingale under the Null Hypothesis. In simple terms, this means the statistic does not systematically drift upward when there is no real effect. This property allows us to construct a confidence interval that typically starts wide and narrows over time. If this interval excludes zero at any point, we can stop the test early with mathematical confidence.
Tuning Sensitivity: The Tau Parameter
A critical component of mSPRT is the parameter Tau. It controls the shape of the mixing distribution and essentially acts as a tuning knob for the test's sensitivity.
- Small Tau: The test becomes more sensitive to small effects but requires more data to converge;
- Large Tau: The test detects large effects very quickly but may miss subtle uplifts.
To choose the optimal Tau, you can use a heuristic based on your Minimum Detectable Effect (MDE) and the standard deviation of your metric (sigma):
tau approx MDE / sigma
This ensures the test is most efficient for the effect size you actually care about.
Python Implementation
Implementing mSPRT from scratch requires complex integration of variance and probability mixing. My library encapsulates this logic, allowing for a streamlined workflow.
Below is an example of how to configure and run a sequential test using the library.
1. Configuration
We define the test parameters. Note that we select CONFIDENCE_SEQUENCE mode for mSPRT and set alpha to 0.05.
from tecore.sequential.schema import SequentialConfig, SequentialMode, EffectDirection
# Configure for mSPRT (Anytime-Valid)
config = SequentialConfig(
mode=SequentialMode.CONFIDENCE_SEQUENCE,
alpha=0.05,
two_sided=True,
effect_direction=EffectDirection.TWO_SIDED,
cs_tau=0.5 # Tuned based on MDE/sigma
)
2. Data Preparation and Execution
The library expects a look_table: a DataFrame where each row represents a cumulative snapshot of the experiment (e.g., daily aggregates).
from tecore.sequential.confidence_sequences import run_confidence_sequence
from tecore.sequential.preprocess import build_look_table_mean
from tecore.sequential.schema import SequentialSpec, LookSchedule
spec = SequentialSpec(
group_col="group",
y_col="revenue",
control_label="A",
test_label="B",
timestamp_col="date"
)
look_table, warnings = build_look_table_mean(
df=experiment_data,
spec=spec,
schedule=LookSchedule(n_looks=30), # Daily checks for a month
cfg=config
)
# Run the sequential test
result = run_confidence_sequence(look_table, config)
if result.stopped:
print(f"Test stopped early at look {result.stop_look}!")
print(f"Decision: {result.decision}")
else:
print("Continue testing.")
The result object contains the decision flag, the stopping point, and the full trajectory of the confidence sequence, which can be visualized to show stakeholders exactly when the boundary was crossed.
Ratio Metrics and Heavy Tails
While the theory is elegant, applying it to real B2C data introduces complications. Standard mSPRT assumes normal distributions or simple sums, but key business metrics like Conversion Rate or ARPU (Average Revenue Per User) are often Ratio Metrics.
Linearization for Ratio Metrics
A ratio metric (e.g., Sum(Revenue) / Count(Users)) has a dependency between the numerator and denominator. Using standard formulas directly can lead to unstable variance estimates early in the test.
To address this, the library applies Linearization (also known as the Delta Method). This technique transforms the ratio into a linearized metric:
Linearized_Y = Numerator - theta * Denominator
Where theta is the baseline ratio estimated from the control group. This transformation decouples the variance, allowing us to apply standard mSPRT logic to complex ratios without losing statistical rigor.
Handling Heavy Tails
In e-commerce data often follows heavy-tailed distributions (a few whales making massive purchases). A single outlier can spike the variance, causing the confidence interval to expand unexpectedly or, conversely, triggering a false positive.
To mitigate this robust sequential testing requires Guardrails:
- Minimum Sample Size: Enforcing a burn-in period (1000 observations) before the stopping rule is activated;
- Winsorization: Capping extreme values at a high percentile (99.9%) to prevent single observations from dominating the variance calculation.
Conclusion
Sequential testing transforms A/B testing from a passive wait-and-see approach into an active monitoring tool. By adopting mSPRT, product teams can stop poorly performing tests immediately, limiting the blast radius of bad features. Conversely, highly successful features can be rolled out 30-50% faster than with fixed-horizon tests, freeing up traffic for new hypotheses.
However, the implementation details matter. Simply applying formulas from a textbook without accounting for ratio metrics or outliers can lead to incorrect decisions. The trustworthy-experiments-core library aims to bridge this gap, providing a production-ready toolkit for rigorous sequential analysis.
By shifting to valid sequential methods, we align statistical incentives with business goals: making faster decisions without sacrificing the truth.
