paint-brush
Multicollinearity and Its Importance in Machine Learningby@nikolao
11,769 reads
11,769 reads

Multicollinearity and Its Importance in Machine Learning

by Nikola O.January 4th, 2022
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

Multicollinearity is a well-known challenge in multiple regression. The term refers to the high correlation between two or more explanatory variables, i.e. predictors. It can be an issue in machine learning, but what really matters is your specific use case. In many cases, multiple regression is used with the purpose of understanding something. For example, an ecologist might want to know what kind of environmental and biological factors lead to changes in the population size of chimpanzees. We think of machine learning algorithms as black boxes that need to predict, but that black box sometimes needs to be understood as well. That's when multicollinearity is an issue.

Company Mentioned

Mention Thumbnail
featured image - Multicollinearity and Its Importance in Machine Learning
Nikola O. HackerNoon profile picture

Multicollinearity is a well-known challenge in multiple regression. The term refers to the high correlation between two or more explanatory variables, i.e. predictors. It can be an issue in machine learning, but what really matters is your specific use case.

Multicollinearity

According to Graham’s study, multicollinearity in multiple regression leads to:


  • Inaccurate parameter estimates,
  • Decreased power
  • Exclusion of significant predictors


As you can see, all of these relate to variable importance. In many cases, multiple regression is used with the purpose of understanding something. For instance, an ecologist might want to know what kind of environmental and biological factors lead to changes in the population size of chimpanzees. If understanding is the goal, checking for correlation between predictors is standard practice. They say that a picture is better than a thousand words so let’s see some multicollinearity.


Imagine you want to understand what drives fuel consumption in cars. We can use a dataset called mtcars that has information on miles per gallon (mpg) and other details about different models of cars. When we plot correlations of all the continuous variables, we can see a lot of strong correlations, i.e. multicollinearity.


When I ran a simple linear regression on this data, the car’s weight (wt) showed up as a statistically significant predictor. We also see that other variables strongly correlate with our target (mpg). Still, the model didn’t recognize them as important predictors due to multicollinearity.


When we think about regression, we need to make an assumption about the distribution, the shape of the data. When we need to specify a distribution, this is a parametric method. For example, the Poisson distribution could represent weekly counts of infection cases. On the other hand, non-parametric methods work with unspecified distributions, which is often the case for machine learning algorithms.

Multicollinearity and Machine Learning

Machine learning (ML) algorithms usually aim to achieve the best accuracy or low prediction error, not to explain the true relationship between predictors and the target variable. This can let you believe that multicollinearity is not an issue in machine learning. In fact, when searching for “multicollinearity and machine learning”, one of the top search results was a story titled “Why multicollinearity isn’t an issue in machine learning”.


I wouldn’t be so blunt.


Multicollinearity can be an issue in machine learning.


For example, Veaux and Ungar compare two nonlinear and non-parametric approaches - feedforward neural network and multivariate adaptive regression splines (MARS). The second one has an advantage over neural networks because it holds the predictive performance and offers explainability. On the other hand, MARS struggles with multicollinearity while neural networks with their redundant architecture don’t.


Well, let’s all just use neural networks for everything, right?

Multicollinearity and ML Explainability

The problem is that in practice, you need to explain your system’s behaviour, especially if it makes decisions. ML Explainability is important so that intelligent technologies don’t inherit societal biases. Also, suppose your users live within the European Union, and you use automated decision making, e.g. declining online credit application. In that case, you need to be able to explain how you got the result.

But what if you really don’t care about explaining and understanding anything? You just want a nice black box that will have an outstanding performance. If that’s the case, you are off the hook; you can ignore correlated predictors but then don’t check variable importance when someone asks you about it.


For those interested in handling correlated features, here are some tips.

Dealing with Multicollinearity

You can deal with multicollinearity by


  1. dropping highly correlated variables
  2. extracting new features with Principal Component Analysis (PCA)


PCA is a pre-processing method where your predictors are transformed into orthogonal vectors. In human words, PCA creates new predictors that explain the variance in the observations better than the original. These new predictors are combinations of the original data, and we call them principal components.


Screenshot from setosa.io

Both of the solutions will limit you differently. As Good and Hardin point out, removing correlated variables can lead to a loss of statistical power, i.e. the model can miss significant predictors as with the car example from earlier. PCA makes the model difficult to interpret, not impossible though. Unfortunately, there is no ‘silver bullet solution, but often it won’t be possible to disentangle your explanatory variables completely.

Conclusion

Correlated predictive variables can be an issue in machine learning and non-parametric methods. Multicollinearity is not your friend, so you should always double-check whether your chosen method can handle it automatically. Still, the solution will depend on your use case; mainly, whether you need to explain what the model has learnt or not.


For a deep dive on the topic, I recommend reading Graham’s study.