Multicollinearity is a well-known challenge in multiple regression. The term refers to the high correlation between two or more explanatory variables, i.e. predictors. It can be an issue in machine learning, but what really matters is your specific use case.
Multicollinearity
According to
- Inaccurate parameter estimates,
- Decreased power
- Exclusion of significant predictors
As you can see, all of these relate to variable importance. In many cases, multiple regression is used with the purpose of understanding something. For instance, an ecologist might want to know what kind of environmental and biological factors lead to changes in the population size of chimpanzees. If understanding is the goal, checking for correlation between predictors is standard practice. They say that a picture is better than a thousand words so letâs see some multicollinearity.
Imagine you want to understand what drives fuel consumption in cars. We can use a dataset called mtcars that has information on miles per gallon (mpg) and other details about different models of cars. When we plot correlations of all the continuous variables, we can see a lot of strong correlations, i.e. multicollinearity.
When I ran a simple linear regression on this data, the carâs weight (wt) showed up as a statistically significant predictor. We also see that other variables strongly correlate with our target (mpg). Still, the model didnât recognize them as important predictors due to multicollinearity.
When we think about regression, we need to make an assumption about the distribution, the shape of the data. When we need to specify a distribution, this is a parametric method. For example, the
Multicollinearity and Machine Learning
Machine learning (ML) algorithms usually aim to achieve the best accuracy or low prediction error, not to explain the true relationship between predictors and the target variable. This can let you believe that multicollinearity is not an issue in machine learning. In fact, when searching for âmulticollinearity and machine learningâ, one of the top search results was a
I wouldnât be so blunt.
Multicollinearity can be an issue in machine learning.
For example,
Well, letâs all just use neural networks for everything, right?
Multicollinearity and ML Explainability
The problem is that in practice, you need to explain your systemâs behaviour, especially if it makes decisions.
But what if you really donât care about explaining and understanding anything? You just want a nice black box that will have an outstanding performance. If thatâs the case, you are off the hook; you can ignore correlated predictors but then donât check variable importance when someone asks you about it.
For those interested in handling correlated features, here are some tips.
Dealing with Multicollinearity
You can deal with multicollinearity by
- dropping highly correlated variables
- extracting new features with Principal Component Analysis (PCA)
Both of the solutions will limit you differently. As
Conclusion
Correlated predictive variables can be an issue in machine learning and non-parametric methods. Multicollinearity is not your friend, so you should always double-check whether your chosen method can handle it automatically. Still, the solution will depend on your use case; mainly, whether you need to explain what the model has learnt or not.
For a deep dive on the topic, I recommend reading