Causal Analysis – Experimentation (AB Testing) and Statistical Techniques

by Varun NakraMay 6th, 2024

Too Long; Didn't Read

Causal analysis background and overview of different techniques to perform a causal analysis.

featured image - Causal Analysis – Experimentation (AB Testing) and Statistical Techniques

Objective – Getting to the cause of something is incredibly important to us because if we can pin down the factor that is causing our outcome of interest, we can possibly manipulate it to change the outcome. A layman's solution to this problem could be examining Y (outcome of interest) and X (a variable that could be the cause of Y) and observing how Y changes when X changes.

For simplicity, assume the change we want to examine is linear in nature. The mathematical framework to analyze this is ‘correlation analysis.’ If Y increases or decreases in proportion to when X increases or decreases, not necessarily in that order; we could say Y and X are ‘correlated’ to each other. However, if Y does change when X changes could we say for sure that X is ‘causing’ a change in Y?

Individuals with higher levels of education tend to earn higher incomes. Therefore, income and education levels are correlated to each other. Perhaps, an increase in the level of education (X) is causing an increase in the level of income (Y) as well? But could it not be possible that something else is causing an increase in the level of income (Y) AND an increase in the level of education (X) but it is hidden from us?

That is both Y and X appear to increase simultaneously but a hidden variable is driving that increase in both Y and X at the same time? We know that individuals from affluent families may have both higher education levels and higher incomes due to inherited wealth and opportunities. That is family status and wealth (Z) could be responsible for both – an increase in the level of education (X) and an increase in the level of income (Y) as well.

Thus, the correlation (association) between Y and X fails to show us the complete picture by not giving us the cause (direct influence) of Y. In other words, X may not be the cause of Y and some other variable Z may be but it is ‘hidden’ to us. Such a variable is known as a ‘confounding’ variable.

Confounding occurs when an extraneous variable influences both the independent (X) and dependent (Y) variables in a way that distorts the true relationship between them. Thus, confounding makes the job of identifying the causal variable difficult, and we need ways and means to get around it.

There are two prominent methodologies for controlling for confounding factors – a) experimental design and b) statistical methods.

Experimental Design – In experimental settings, causal analysis often involves manipulating the independent variable(s) and observing the effects on the dependent variable(s) while controlling for other factors. Randomized controlled trials (RCTs) and A/B testing are the gold standard for establishing causal relationships, as random assignment helps minimize bias and confounding.

A/B testing, also known as split testing, is a statistical method used to compare two or more versions of a webpage, app, marketing campaign, or other digital assets to determine which one performs better in achieving a specific goal. The goal could be anything from increasing click-through rates, improving conversion rates, enhancing user engagement, or maximizing revenue. While A/B testing and RCTs share many similarities, they also have differences, particularly in their contexts of use and the level of control over experimental conditions.

A/B testing is commonly used in digital marketing, website optimization, and user experience design, where experiments can be conducted quickly and at scale. RCTs, on the other hand, are often used in clinical research, where rigorous control over variables and ethical considerations are paramount.

A question arises “Why do we need two groups? – a Control (A) group and a Treatment (B) group? Why can’t we perform a before-after inference? That is we could expose the users to the treatment and compare the performance of the outcome “before treatment” and “after treatment” without having the need of two separate groups. The reason is that we can’t causally infer from this analysis that the effect that we have observed can be attributed to the treatment alone.

This is because other factors are not being controlled for! A prior experience in time can’t be used as a control group because that experience will be influenced by other factors that might change across different points in time. Thus, the control group of the AB test needs to be controlled in parallel to the treatment group; that is, at the same time.

This ensures that “all things are equal.” The only change between the control group and the treatment group is the “treatment” itself. Both groups are exposed to the same conditions at the same time. Thus, we need two separate groups.

An outline or a high-level overview of the key steps or concepts of the experimental design are as follows:

· We need to define the null hypothesis and the practical significant boundary and characterize the outcome metric for the experiment. How do you decide whether the test is “practically significant” and why is “practical significance” necessary? – Let’s say the outcome metric difference doesn’t turn out to be statistically significant and we fail to reject the Null hypothesis. That doesn’t mean that there is no treatment effect for sure. Probably it comes through when we increase the sample size. Therefore, we need another level to test out the significance.

That is we need to know what is practically important to us and test the significance of that. The practical significance boundary can be defined using something known as the minimum detectable effect (MDE) which is defined by the power of the experiment. If the lower bound of the confidence interval (CI) is larger than the MDE, then the test has practical significance. For example, if the CI = [5%, 7.5%] and the MDE = 3% then we can conclude to have a practical significance, since 5% > 3%

· We need to define the randomization unit and the population we want to target. This is followed by calculating the size of the experiment and how long the experiment should be running.

· We need to perform sanity checks around the experiment – To catch those; we look at the guardrail metrics or invariants. These metrics should not change between the Control and Treatment. If they change, any measured differences are likely the result of other changes we made rather than the feature being tested.

· The reliability of the experiment is also important. That is whether the experimental results could be reproduced if required.

Statistical methods – Causal analysis employs various statistical methods and techniques to assess causality and estimate causal effects. These methods comprise the following:

· Regression controlling for confounders – As the name suggests, this technique involves fitting a regression model on Y after controlling for confounders. In mathematical terms, we model Y as a function of both X and Z where X represents the variable we are interested in analyzing for its causal effect on Y, and Z is the confounder. The drawback of this method is that we normally don’t have the list of and/or the data about all the cofounders.

· Regression Discontinuity Design (RDD) – Suppose a university implements a scholarship program for incoming freshmen, where students who score above a certain threshold on a standardized test are eligible for a full scholarship covering tuition and fees, while those who score below the threshold receive no scholarship. The cut-off threshold for eligibility is set at a test score of 80 out of 100. Our objective of causal analysis would be to determine the causal effect of receiving the scholarship on students' academic performance in their first year of college.

Similar to what we did in the “regression controlling for confounders”, we regress the outcome variable (e.g., GPA) against “control variables” which are confounders such as high school GPA, and socioeconomic status, but we also include a dummy variable indicating the treatment status (above or below the threshold of the score 80 out of 100).

If the coefficient of the dummy variable turns out to be statistically significant, then it would suggest that the students who receive the scholarship have higher GPAs in their first year of college compared to students who do not receive the scholarship.

The coefficient of the dummy variable is known as the “discontinuity coefficient,” and it represents the difference in academic performance between scholarship recipients and non-recipients at the cut-off threshold.

· ITS (Interrupted Time series) – This refers to the “before-after inference” mentioned above in the section on “experimental design.” That is, the outcome variable is observed in time before the point of intervention and after the point of intervention, and a statistical analysis is performed on the time series. Similar to RDD, a dummy variable on time is included in the regression besides confounding variables. The quantitative analysis comprises time series models such as ARIMA.

· Difference in Differences – Similar to ITS, there is a concept of “before-after inference,” but here, we have two separate groups – treatment and control, though, despite that, it is not an experimental design but a statistical technique as regression analysis is performed. In the “before” phase (before applying the treatment to the treatment group), a difference in the outcome between the treatment and the control group is calculated.

Then, in the “after” phase (after applying the treatment to the treatment group), a difference in the outcome between the treatment and the control group is calculated. Then, a difference in the difference (DID) is calculated by subtracting the pre-intervention difference from the post-intervention difference. The “DID estimate” represents the causal effect of the treatment on the outcome.

· Instrumental variables – An instrumental variable is a variable that changes Y only through X. That is, it is correlated to only X but not to any other (confounding) variables that may be omitted from the model. As a result of this, when the instrumental variable changes, X changes, and then Y changes. Thus, the instrumental variable isolates the change in Y due to the change in X. Going back to our discussion on education and earnings in the “objective” section of this article, we want to estimate the causal effect of education on earnings.

The challenge in this context is that education is often “endogenous” – individuals with higher innate abilities or motivation may be more likely to pursue higher levels of education, and these same characteristics may also lead to higher earnings. Those characteristics can affect earnings (Y) directly without affecting education (X). Therefore, they are not instrumental variables.

However, if we create another variable such as “mandatory schooling”, it would affect the level of education and through that, would affect the earnings. Thus, it will be an instrument variable that will affect earnings (Y) through the level of education (X) but won’t affect Y directly.

We will explore the aforementioned terms, especially “Difference in Differences” and “Instrumental variables” with more mathematical detail in the next articles.