We now have an arsenal of algorithms that can handle any problem we throw at them, thanks to breakthroughs in machine learning and deep learning. But the majority of these sophisticated and complex algorithms have a problem; they need to be simpler to understand.
Nothing compares to the simplicity and interpretability of linear regression when it comes to the interpretability of machine learning models. However, there may be some problems with the interpretability of the linear regression, particularly if the multicollinearity assumptions of the linear regression are broken.
In this article, we will be looking at the concept of multicollinearity, its detection & treatment through a vital statistical metric - “Variance Inflation Factor”.
The variance inflation factor is a statistical metric used to detect multicollinearity in supervised predictive models.
First of all, let us understand the concept of multicollinearity before jumping inside the intuition of VIF. Consider that we need to build a regression model to predict the salary of a person. We have the following independent variables in our dataset-
Our data looks like this-
So, this is a traditional multiple linear regression problem and any multiple linear regression can be represented using the mathematical formula-
Y = m1x1 + m2x2 + m3x3+.....+mnxn + c
Where,
According to our use case, we define our problem as-
Salary = (0.5 * Age) + (0.6 * Years of experience) + (0.1 * Job) + (0.2 * Gender) + (0.6*City) + 0.7(Cost of living index)
(Just assume some random values for the coefficients or slopes to understand the concept in a simple way).
The above formula intrinsically means that-
Now, we can logically think about a few things according to the real-life scenario-
The primary objective of any regression model is to learn about the variance in the dataset. Thereby, finding how each of the independent variables affects the target variable (by identifying a proper coefficient value) and predicting the target for unseen data points. Whenever we use a specific independent variable, it should be capable of providing some information to the model regarding the target variable that other variables are not able to capture. If 2 variables are explaining the same variance (affecting the target variable in an approximately unique way) then there is no point in keeping both the independent variables since it will be an additional burden to the model that can increase the time and space complexity without any usage.
For example,
Consider that you have 4 friends - John, James, Joseph, and Jack. All of them went to watch a movie. Unfortunately, you were not able to get the tickets. So, your friends said that they will watch the movie and explain the story scene by scene. The movie’s duration was 3 hours.
Assume that-
Now, it's Jack’s turn. He doesn't know that the other 3 people already covered all parts of the story. He again explained to you the story from the 121st minute to the 180th minute.
Now, You have heard the last 60 minutes’ story twice. Either Joseph or Jack was enough to explain the last 60 minutes. Since they both explained the same thing, there is no additional information that they can provide to you. Technically, you heard the story for 240 minutes, although the entire information can be covered in 180 minutes. The time and energy spent by Jack went in vain and you also lost time by hearing unwanted information.
So, Jack and Joseph are multicollinear !!!
Multicollinearity refers to a situation in which more than two explanatory variables in a multiple regression model are highly linearly related.
You will have an obvious doubt about what the issue is if 2 variables are highly collinear.
We discussed the coefficients in the multiple linear regression formula,
According to our problem statement, it is -
Salary = (0.5 * Age) + (0.6 * Years of experience) + (0.1 * Job) + (0.2 * Gender) + (0.6*City) + 0.7(Cost of living index)
Here, we know that Age and Years of experience are highly correlated.
The next question is -
How can we identify whether multicollinearity is present in a dataset or not ?
The answer is “VIF”.
Now, let's understand the intuition behind VIF.
We discussed earlier that a specific predictor variable should not be able to get explained by establishing some relationships among certain other predictor variables in the data.
Before moving to the example, let's understand the meaning of R Squared.
R Squared aka Coefficient of determination is one of the most widely used evaluation metrics for linear regression models. R squared is considered a goodness of fit metric which most of the time ranges around 0 to 1. The higher value of R Squared is examined as higher the coherence and predictive ability of the model.
The VIF is represented with a mathematical formula-
VIF - 1/(1-R Squared).
In order to identify the multicollinearity, we need to derive the VIF of all the predictor variables present in the dataset.
Let’s say, if we need to calculate the VIF for Age then we should consider “Age” as the target variable, and all other predictor variables as independent variables and should design a multiple linear regression model.
For example,
For finding the VIF of Age,
Let us build a model like this-
Age = (0.6 * Years of experience) + (0.1 * Job) + (0.2 * Gender) + (0.6*City) + 0.7(Cost of living index)
We will be getting an R Squared value for this model.
Let's assume that the R Squared value we got is 0.85
Now,
VIF = 1 /(1- 0.85)
VIF = 1/ 0.15
VIF = 6.67
One thing we should note here is that if the R Squared is a large number (near 1). Then denominator of the VIF formula will become a small number (Because the denominator is 1- R Squared)
If the denominator is a small number then the value of VIF will be a large number (Because we are dividing 1 by the denominator ie 1 - R Squared).
So, R squared is proportional to VIF.
For example,
We got a high VIF value of 6.67 earlier because we had a high R Squared value (0.85).
If we had a low R Squared value like 0.2 then our VIF would have been low like-
VIF = 1 /(1- 0.2)
VIF = 1/ 0.8
VIF = 1.25
Indirectly, this conveys that-
A rule of thumb for interpreting the variance inflation factor:
Consider that, we got the VIF for each variable as mentioned in the table-
It is very clear that “Age” and “Years of experience” are highly inflated and correlated since their value is above 5.
Hence, we can remove one of these variables from this model (Preferably, we can remove “Age” as it has the highest VIF for now). So, after removing the “Age”, the number of independent variables has come down to 5 from 6.
We can repeat the above-mentioned process one more time-
Now, we can see that the VIF of Years of experience has come down. Also, the City and Cost of living seem to be inflated. Hence, we can drop one of those variables from the model. In the next iteration of modelling, we got the VIF values as shown below-
Now, all of the variables look non-correlated and independent. Hence, we can proceed with the actual modeling considering “Salary” as the target variable.
We started with 6 independent variables for the regression model. With the help of VIF, we were able to identify multicollinearity in the data and identified 2 variables for dropping from the model. Now, our regression model will be more generalized, accurate and less complex.
As a learning from the whole article, we can summarize the entire content via the following points-
In this article, we discussed one of the most important but foundational fundamental concepts in applied statistics. This VIF metric is available as an inbuilt library in most data science-oriented programming languages such as Python or R. Hence, it is easy to implement this once you understand the theoretical intuition. I have added the links to some of the advanced materials in the references section, where you can dive deep into the complex calculations if you are interested.