Variance Inflation Factor - A Pertinent Statistical Metric for the Discernment of Multicollinearity

Written by sanjaykn170396 | Published 2023/02/03
Tech Story Tags: artificial-intelligence | machine-learning | statistics | datascience | linear-regression | multicollinearity | regression | predictive-analytics

TLDRThe variance inflation factor (VIF) is a statistical metric used to detect multicollinearity in supervised predictive models. VIF can be used to explain why certain variables are not able to capture the same variance as the target variable in an approximately unique way. The concept of VIF is based on the idea that the number of times an independent variable is greater than or less than the target.via the TL;DR App

Introduction

We now have an arsenal of algorithms that can handle any problem we throw at them, thanks to breakthroughs in machine learning and deep learning. But the majority of these sophisticated and complex algorithms have a problem; they need to be simpler to understand.

Nothing compares to the simplicity and interpretability of linear regression when it comes to the interpretability of machine learning models. However, there may be some problems with the interpretability of the linear regression, particularly if the multicollinearity assumptions of the linear regression are broken.

In this article, we will be looking at the concept of multicollinearity, its detection & treatment through a vital statistical metric - “Variance Inflation Factor”.

The concept behind Multicollinearity

The variance inflation factor is a statistical metric used to detect multicollinearity in supervised predictive models.

First of all, let us understand the concept of multicollinearity before jumping inside the intuition of VIF. Consider that we need to build a regression model to predict the salary of a person. We have the following independent variables in our dataset-

  1. Age
  2. Years of experience
  3. Job
  4. Gender
  5. City
  6. Cost of living index in the place they are residing

Our data looks like this-

So, this is a traditional multiple linear regression problem and any multiple linear regression can be represented using the mathematical formula-

Y = m1x1 + m2x2 + m3x3+.....+mnxn + c

Where,

  • Y is the target variable
  • x1, x2, x3…xn are all independent variables
  • m1, m2, m3, mn are all slopes of x1, x2, x3, and xn respectively
  • c is the intercept

According to our use case, we define our problem as-

Salary = (0.5 * Age) + (0.6 * Years of experience) + (0.1 * Job) + (0.2 * Gender) + (0.6*City) + 0.7(Cost of living index)

(Just assume some random values for the coefficients or slopes to understand the concept in a simple way).

The above formula intrinsically means that-

  1. Whenever Age is increased by one unit, Salary will increase by 0.5 units keeping all other variables constant.
  2. Whenever the Cost of living index is increased by one unit, the Salary will increase by 0.7 units keeping all other variables constant.

Now, we can logically think about a few things according to the real-life scenario-

  1. All of the above-mentioned independent variables seem to be having some relationship with the target variable (Salary).
  2. However, we know that Age and Years of experience are having a straightforward linear relationship with each other. Assuming that a person starts to work around the age of 25, His/Her working experience is going to be increased by one year whenever his/her age increases by one year.

The primary objective of any regression model is to learn about the variance in the dataset. Thereby, finding how each of the independent variables affects the target variable (by identifying a proper coefficient value)  and predicting the target for unseen data points. Whenever we use a specific independent variable, it should be capable of providing some information to the model regarding the target variable that other variables are not able to capture. If 2 variables are explaining the same variance (affecting the target variable in an approximately unique way) then there is no point in keeping both the independent variables since it will be an additional burden to the model that can increase the time and space complexity without any usage.

For example,

Consider that you have 4 friends - John, James, Joseph, and Jack. All of them went to watch a movie. Unfortunately, you were not able to get the tickets. So, your friends said that they will watch the movie and explain the story scene by scene. The movie’s duration was 3 hours.

Assume that-

  1. John explained the story of the first 60 minutes
  2. James explained the story from the 61st minute to the 120th minute
  3. Joseph  explained the story from the 121st  minute to the 180th minute

Now, it's Jack’s turn. He doesn't know that the other 3 people already covered all parts of the story. He again explained to you the story from the 121st  minute to the 180th minute.

Now, You have heard the last 60 minutes’ story twice. Either Joseph or Jack was enough to explain the last 60 minutes. Since they both explained the same thing, there is no additional information that they can provide to you. Technically, you heard the story for 240 minutes, although the entire information can be covered in 180 minutes. The time and energy spent by Jack went in vain and you also lost time by hearing unwanted information.

So, Jack and Joseph are multicollinear !!!

Multicollinearity refers to a situation in which more than two explanatory variables in a multiple regression model are highly linearly related.

You will have an obvious doubt about what the issue is if 2 variables are highly collinear.

We discussed the coefficients in the multiple linear regression formula,

According to our problem statement, it is -

Salary = (0.5 * Age) + (0.6 * Years of experience) + (0.1 * Job) + (0.2 * Gender) + (0.6*City) + 0.7(Cost of living index)

Here, we know that Age and Years of experience are highly correlated.

  • If we remove one of these collinear variables from the model then we can see that the model adapts itself into a new set of coefficients for all of the independent variables.
  • Sometimes, if there is approximate collinearity, it might not lead to a drastic decrease in the predictive ability of the model but it will certainly increase the time and space complexity of the model.
  • However, if multiple variables are perfectly correlated then it may lead to the collapse of the model leading to poor performance.
  • For maintaining the assumptions of linear models, we should never take a risk by incorporating multiple collinear variables. A linear model always expects that all of the predictor variables are perfectly independent ie a specific predictor variable should not be able to get explained by establishing some relationships among certain other predictor variables in the data. In such a scenario, all of the coefficients will be very efficient leading to a generalized hyperplane that can maintain the consistency of prediction in any unseen set of data points

The basic intuition behind VIF


The next question is -

How can we identify whether  multicollinearity is present in a dataset or not ?

The answer is “VIF”.

Now, let's understand the intuition behind VIF.

We discussed earlier that a specific predictor variable should not be able to get explained by establishing some relationships among certain other predictor variables in the data.

Before moving to the example, let's understand the meaning of R Squared.

R Squared aka Coefficient of determination is one of the most widely used evaluation metrics for linear regression models. R squared is considered a goodness of fit metric which most of the time ranges around 0 to 1. The higher value of R Squared is examined as higher the coherence and predictive ability of the model.

The VIF is represented with a mathematical formula-

VIF - 1/(1-R Squared).

In order to identify the multicollinearity, we need to derive the VIF of all the predictor variables present in the dataset.

Let’s say, if we need to calculate the VIF for Age then we should consider “Age” as the target variable, and all other predictor variables as independent variables and should design a multiple linear regression model.

For example,

For finding the VIF of Age,

Let us build a model like this-

Age  =  (0.6 * Years of experience) + (0.1 * Job) + (0.2 * Gender) + (0.6*City) + 0.7(Cost of living index)

We will be getting an R Squared value for this model.


Let's assume that the R Squared value we got is 0.85

Now,

VIF = 1 /(1- 0.85)

VIF = 1/ 0.15

VIF = 6.67

One thing we should note here is that if the R Squared is a large number (near 1). Then denominator of the VIF formula will become a small number (Because the denominator is 1- R Squared)

If the denominator is a small number then the value of VIF will be a large number (Because we are dividing 1 by the denominator ie  1 - R Squared).

So, R squared is proportional to VIF.

For example,

We got a high VIF value of 6.67 earlier because we had a high R Squared value (0.85).

If we had a low R Squared value like 0.2 then our VIF would have been low like-

VIF = 1 /(1- 0.2)

VIF = 1/ 0.8

VIF = 1.25

Indirectly, this conveys that-

  • If VIF is a low number then that means other predictor variables together are not properly able to explain the variance and establish a stable relationship with our specified response variable. In other words, “The response variable is statistically independent or isolated”.
  • If VIF is a high number then that means another predictor variable together is somewhat properly able to explain the variance and establish a stable relationship with our specified response variable. In other words, “The response variable is statistically not perfectly independent”.
  • We need to use a similar procedure as used for “Age” to find the VIF of all other independent variables. Usually, a VIF value greater than 5 is not considered as good.

A rule of thumb for interpreting the variance inflation factor:

  • 1 = not correlated.
  • Between 1 and 5 = moderately correlated.
  • Greater than 5 = highly correlated.

Consider that, we got the VIF for each variable as mentioned in the table-

It is very clear that “Age” and “Years of experience” are highly inflated and correlated since their value is above 5.

Hence, we can remove one of these variables from this model (Preferably, we can remove “Age” as it has the highest VIF for now). So, after removing the “Age”, the number of independent variables has come down to 5 from 6.

We can repeat the above-mentioned process one more time-

Now, we can see that the VIF of Years of experience has come down. Also, the City and Cost of living seem to be inflated. Hence, we can drop one of those variables from the model. In the next iteration of modelling, we got the VIF values as shown below-

Now, all of the variables look non-correlated and independent. Hence, we can proceed with the actual modeling considering “Salary” as the target variable.

We started with 6 independent variables for the regression model. With the help of VIF, we were able to identify multicollinearity in the data and identified 2 variables for dropping from the model. Now, our regression model will be more generalized, accurate and less complex.


Conclusion

As a learning from the whole article, we can summarize the entire content via the following points-

  • In a multivariate regression model, the variance inflation factor (VIF) measures the multicollinearity among the independent variables.
  • While multicollinearity does not lower the model's explanatory power, it does reduce the statistical significance of the independent variables, therefore detecting it is crucial.
  • A large VIF on an independent variable denotes a strongly collinear link to the other variables, which should be taken into account or accounted for in the model's structure and independent variable selection.

In this article, we discussed one of the most important but foundational fundamental concepts in applied statistics. This VIF metric is available as an inbuilt library in most data science-oriented programming languages such as Python or R. Hence, it is easy to implement this once you understand the theoretical intuition. I have added the links to some of the advanced materials in the references section, where you can dive deep into the complex calculations if you are interested.

References


Written by sanjaykn170396 | Data scientist | ML Engineer | Statistician
Published by HackerNoon on 2023/02/03