Multicollinearity is a well-known challenge in multiple regression. The term refers to the high correlation between two or more explanatory variables, i.e. predictors. It can be an issue in machine learning, but what really matters is your specific use case. Multicollinearity According to , multicollinearity in multiple regression leads to: Graham’s study Inaccurate parameter estimates, Decreased power Exclusion of significant predictors As you can see, all of these relate to variable importance. In many cases, multiple regression is used with the purpose of understanding something. For instance, an ecologist might want to know what kind of environmental and biological factors lead to changes in the population size of chimpanzees. If understanding is the goal, checking for correlation between predictors is standard practice. They say that a picture is better than a thousand words so let’s see some multicollinearity. Imagine you want to understand what drives fuel consumption in cars. We can use a dataset called mtcars that has information on miles per gallon (mpg) and other details about different models of cars. When we plot correlations of all the continuous variables, we can see a lot of strong correlations, i.e. multicollinearity. When I ran a simple linear regression on this data, the car’s weight (wt) showed up as a statistically significant predictor. We also see that other variables strongly correlate with our target (mpg). Still, the model didn’t recognize them as important predictors due to multicollinearity. When we think about regression, we need to make an assumption about the distribution, the shape of the data. When we need to specify a distribution, this is a parametric method. For example, the could represent weekly counts of infection cases. On the other hand, non-parametric methods work with unspecified distributions, which is often the case for machine learning algorithms. Poisson distribution Multicollinearity and Machine Learning Machine learning (ML) algorithms usually aim to achieve the best accuracy or low prediction error, not to explain the true relationship between predictors and the target variable. This can let you believe that multicollinearity is not an issue in machine learning. In fact, when searching for “multicollinearity and machine learning”, one of the top search results was a titled “Why multicollinearity isn’t an issue in machine learning”. story I wouldn’t be so blunt. Multicollinearity can be an issue in machine learning. For example, compare two nonlinear and non-parametric approaches - feedforward neural network and multivariate adaptive regression splines (MARS). The second one has an advantage over neural networks because it holds the predictive performance and offers explainability. On the other hand, MARS struggles with multicollinearity while neural networks with their redundant architecture don’t. Veaux and Ungar Well, let’s all just use neural networks for everything, right? Multicollinearity and ML Explainability The problem is that in practice, you need to explain your system’s behaviour, especially if it makes decisions. is important so that intelligent technologies don’t inherit . Also, suppose your users live within the European Union, and you use , e.g. declining online credit application. In that case, you need to be able to explain how you got the result. ML Explainability societal biases automated decision making But what if you really don’t care about explaining and understanding anything? You just want a nice black box that will have an outstanding performance. If that’s the case, you are off the hook; you can ignore correlated predictors but then don’t check variable importance when someone asks you about it. For those interested in handling correlated features, here are some tips. Dealing with Multicollinearity You can deal with multicollinearity by dropping highly correlated variables extracting new features with Principal Component Analysis (PCA) is a pre-processing method where your predictors are transformed into orthogonal vectors. In human words, PCA creates new predictors that explain the variance in the observations better than the original. These new predictors are combinations of the original data, and we call them principal components. PCA Both of the solutions will limit you differently. As point out, removing correlated variables can lead to a loss of statistical power, i.e. the model can predictors as with the car example from earlier. PCA makes the model difficult to interpret, not impossible though. Unfortunately, there is no ‘silver bullet solution, but often it won’t be possible to disentangle your explanatory variables completely. Good and Hardin miss significant Conclusion Correlated predictive variables can be an issue in machine learning and non-parametric methods. Multicollinearity is not your friend, so you should always double-check whether your chosen method can handle it automatically. Still, the solution will depend on your use case; mainly, whether you need to explain what the model has learnt or not. For a deep dive on the topic, I recommend reading . Graham’s study

Target

Communication is the Key: 3 Design Concepts to Help You Improve It

Random Forest Regression in R: Code and Interpretation

Instagram

Multicollinearity and Its Importance in Machine Learning

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

3 Data Distributions for Counts in Layman’s Terms

10 Pieces of Sci-Fi Armor We All Wish Were Real

10 Movie Tie-In Games That Were Actually Good

10 Most Anticipated Sci-fi Games of 2022 You Can’t Miss

10 Female Video Game Characters You Need to Know

10 Cutest Pink Pokémon of All Time

3 Data Distributions for Counts in Layman’s Terms

10 Pieces of Sci-Fi Armor We All Wish Were Real

10 Movie Tie-In Games That Were Actually Good

10 Most Anticipated Sci-fi Games of 2022 You Can’t Miss

10 Female Video Game Characters You Need to Know

10 Cutest Pink Pokémon of All Time

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps