Random forest is one of the most popular algorithms for multiple machine learning tasks. This story looks into random forest regression in R, focusing on understanding the output and variable importance.
If you prefer Python code, here you go.
When decision trees came to the scene in 1984, they were better than classic multiple regression. One of the reasons is that decision trees are easy on the eyes. People without a degree in statistics could easily interpret the results in the form of branches.
Additionally, decision trees help you avoid the synergy effects of interdependent predictors in multiple regression. Synergy (interaction/moderation) effect is when one predictor depends on another predictor. Hereis a nice example from a business context.
On the other hand, regression trees are not very stable - a slight change in the training set could mean a great change in the structure of the whole tree. Simply put, they are not very accurate.
But what if you combine multiple trees?
Randomly created decision trees make up a , a type of ensemble modeling based on bootstrap aggregating, .i.e. bagging. First, you create various decision trees on bootstrapped versions of your dataset, i.e. random sampling with replacement (see the image below). Next, you aggregate (e.g. average) the individual predictions over the decision trees into the final random forest prediction.
Notice that we skipped some observations, namely Istanbul, Paris and Barcelona. These observations, i.e. rows, are called out-of-bag and used for prediction error estimation.
Based on CRAN’s list of packages, 63 R libraries mention random forest. I recommend you go over the options as they range from bayesian-based random forest to clinical and omics specific libraries. You could potentially find random forest regression that fits your use-case better than the original version. Still, I wouldn’t use it if you can’t find the details of how exactly it improves on Breiman’s and Cutler’s implementation. If you have no idea, it’s safer to go with the original - randomForest.
Code-wise, it’s pretty simple, so I will stick to the example from the documentation using 1974 Motor Trend data.
### Import libraries
library(randomForest)
library(ggplot2)
set.seed(4543)
data(mtcars)
rf.fit <- randomForest(mpg ~ ., data=mtcars, ntree=1000,
keep.forest=FALSE, importance=TRUE)
I will specifically focus on understanding the performance and variable importance. So after we run the piece of code above, we can check out the results by simply running rf.fit
.
> rf.fit
Call:
randomForest(formula = mpg ~ ., data = mtcars, ntree = 1000,
keep.forest = FALSE, importance = TRUE)
Type of random forest: regression
Number of trees: 1000
No. of variables tried at each split: 3
Mean of squared residuals: 5.587022
% Var explained: 84.12
Notice that the function ran random forest regression, and we didn’t need to specify that. It will perform nonlinear multiple regression as long as the target variable is numeric (in this example, it is Miles per Gallon - mpg
). But, if it makes you feel better, you can add type= “regression”.
The mean of squared residuals and % variance explained indicate how well the model fits the data. Residuals are a difference between prediction and the actual value. In our example, 5.6 means that we were wrong by 5.6 miles/gallon on average. If you want to have a deep understanding of how this is calculated per decision tree, watch .
You can experiment with, i.e. increase or decrease, the number of trees (ntree
) or the number of variables tried at each split (mtry
) and see whether the residuals or % variance change.
If you also want to understand what the model has learnt, make sure that you do importance = TRUE
as in the code above.
Random forest regression in R provides two outputs: decrease in mean square error (MSE) and node purity. Prediction error described as MSE is based on permuting out-of-bag sections of the data per individual tree and predictor, and the errors are then averaged. In the regression context, Node purity is the total decrease in residual sum of squares when splitting on a variable averaged over all trees (i.e. how well a predictor decreases variance). MSE is a more reliable measure of variable importance. If the two importance metrics show different results, listen to MSE. If all of your predictors are numerical, then it shouldn’t be too much of an issue - read more here.
The built-in varImpPlot()
will visualize the results, but we can do better. Here, we combine both importance measures into one plot emphasizing MSE results.
### Visualize variable importance ----------------------------------------------
# Get variable importance from the model fit
ImpData <- as.data.frame(importance(rf.fit))
ImpData$Var.Names <- row.names(ImpData)
ggplot(ImpData, aes(x=Var.Names, y=`%IncMSE`)) +
geom_segment( aes(x=Var.Names, xend=Var.Names, y=0, yend=`%IncMSE`), color="skyblue") +
geom_point(aes(size = IncNodePurity), color="blue", alpha=0.6) +
theme_light() +
coord_flip() +
theme(
legend.position="bottom",
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank()
)
In terms of assessment, it always comes down to some theory or logic behind the data. Do the top predictors make sense? If not, investigate why.
“Rome was not built in one day, nor was any reliable model.”
Modeling is an iterative process. You can get a better idea about the predictive error of your random forest regression when you save some data for performance testing only. You might also want to try out other methods.