Building Request-A-Quote Yelp (Ex Yahoo). Marketplaces x Behavioral Sciences.
P value is the probability that the results we are seeing are real and not by random chance. P-Hacking is a term used to describe the scientific manipulation of data to get the desired P value. All of us do this with our experiments, consciously or not.
Source: Atoz Markets
To prevent P-Hacking you need to understand 3 concepts - Multiple comparison, Power analysis and Confidence intervals.
The more comparisons you make the more likely you are to see a false positive. In other words, you can end up making business decisions based on incorrect metric values. In context of experiments comparisons can be cohorts, metrics, or dimensions of a metric.
PMs (including I) get this wrong when we slap 10s of metrics in hope of finding some positive result or under the guise of protecting other parts of the business.
You can avoid that by deciding the comparisons before running the experiment and limiting them. As a general rule of thumb, I use at most 5 metrics and 2/3 different treatment cohorts per experiment.
If you absolutely need to use more metrics, you should “correct” for the problem.
One way to do so is Bonferroni correction (in practice the correction is best left to Data Scientists).
However, adjusting the results reduces the power and requires the experiment to have a greater sample size.
You should factor that in while conducting the power analysis.
Power is the probability of finding an effect if one can be found. Power analysis is used to estimate the minimum sample size (number of users per cohort) needed for the experiment to get to the desired power.
PMs get this wrong when we stop the experiment before it has enough samples to save time because the experiment is trending positive.
This is the one the biggest reasons that experiments fail. As seen in the figure T’ is the optimal stopping point based on the power analysis. Running the experiment till that point will make you come to the conclusion that the change has a negative impact on the metric. However, at T, the change seems to incorrectly have a positive impact.
The power analysis depends on the decision metric, minimum detectable effect (expected lift), and the desired power level:
While this section doesn’t directly prevent P-Hacking it will help you understand and identify it better.
CI is the range the metric value lies in. Usually we use the 95% CI. A CI of -10 to +10 indicates that the mean of the metric value lies in that range with a 95% probability.
PMs get this wrong when we only look at the exact metric value and not the entire interval.
In my experience the majority of people make decisions by only looking at the mean of the CI not the entire range. This creates a false sense of security. Practical advice for interpreting the CI:
Previously, I built the Yelp experimentation program from the ground up
and launched new products (chatbots, AI assistant, and accessible interfaces) at Yahoo!