Have you ever wondered how the machine learning models are constructed? Today we will explore this and learn some quick techniques on how to find out which variables are influencing the model results and by how much.
We will use the FIFA 2018 dataset on Kaggle and explore the following models:
This will be the agenda for today:
So without further ado let's get started.
Let us first talk a bit about decision trees.
Decision tree algorithms start with a root node from a data sample and then select features based on metrics like Gini impurity or information gain and splits the root nodes into leaf nodes/end nodes till no more split is possible. This is illustrated in the diagram below with a sample tree.
After the data and the libraries have been imported, the following lines of code will help to train the decision tree model.
#Create the dependent variable
y = (data['Man of the Match'] == "Yes")
#Create the independent variable
feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]
x = data[feature_names]
#Train the decision tree model
train_x, test_x, train_y, test_y = train_test_split(x,y, random_state=1)
dt_model = DecisionTreeClassifier(random_state=0, max_depth=5, min_samples_split=5).fit(train_x,train_y)
pred_y = dt_model.predict(test_x)
cm = confusion_matrix(test_y,pred_y)
print(cm)
accuracy_score(test_y,pred_y)
We will get the following output from the confusion matrix:
[[ 9 7]
[ 6 10]]
0.59375
The accuracy of the decision tree model is moderate at 59.38% with (10+9) targets being corrected predicted and (7+6) being false positives and false negatives respectively.
Let us now learn a bit about the random forest model and then train the data with it.
Random forest is an ensemble learning algorithm that works by constructing multiple decision trees and outputs the class that is either the mode or the mean prediction of the individual decision trees.
An illustration is given below:
We will now use the code below to train the random forest model.
# Train the RF model
rf_model = RandomForestClassifier(n_estimators=100, random_state=1).fit(train_x,train_y)
pred_y = rf_model.predict(test_x)
cm = confusion_matrix(test_y,pred_y)
print(cm)
accuracy_score(test_y,pred_y)
The output of the Random forest model is given below:
[[10 6]
[ 3 13]]
0.71875
The random forest model has a better accuracy at 71.88% with (10+13) targets identified correctly and (6+3) targets mis-classified-6 being false positives and 3 being false negatives.
We will now look at the most influential variables in both the models and how they are affecting the accuracy. We will use 'PermutationImportance' from the 'eli5' library for this purpose. We can do this with only one line of code as given below.
# Import PermutationImportance from the eli5 library
from eli5.sklearn import PermutationImportance
# Influential variables for Decision Tree model
eli5.show_weights(perm, feature_names = test_x.columns.tolist())
The influential variables in the decision tree model is :
The most influential variables in the decision tree model is 'Goal scored', 'On-target', 'Distance Covered (Kms)' and 'Off-Target'. There are also variables that influence the accuracy negatively like 'Attempts' and 'Corners' - hence we can drop these variables from the model to increase the accuracy. Some variables like 'Red', 'Ball Possession %' etc has no influence on the accuracy of the model.
The weights indicate by how much percentage the model accuracy is impacted by the variable when the variables are re-shuffled. For eg: By using the feature 'Goal Scored' the model accuracy can be improved by 14.37% in a range of (+-) 11.59%.
The influential variables in the random forest model is :
As you can observe there are significant differences in the variables that influence the 2 models and for the same variable like say 'Goal Scored' the percentage of change in accuracy also differs.
Let us now take one variable say 'Distance Covered (Kms)' and try to find out the threshold at which the accuracy increases. We can do this easily with Partial dependence plots (PDP).
A partial dependence (PD) plot depicts the functional relationship between input variables and predictions. It shows how the predictions partially depend on values of the input variables.
For eg: We can create a partial dependence plot of the variable 'Distance Covered (Kms)' to understand how changes in the values of the variable 'Distance Covered (Kms)' affects overall accuracy of the model.
We will start with the decision tree model first.
# Import the libraries
from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots
# Select the variable/feature to plot
feature_to_plot = 'Distance Covered (Kms)'
# PDP plot for Decision tree model
pdp_dist = pdp.pdp_isolate(model=dt_model,dataset=test_x,
model_features=feature_names,
feature= feature_to_plot)
pdp.pdp_plot(pdp_dist, feature_to_plot)
plt.show()
The plot will look like this:
If distance covered is 102 KM, then that influences the model positively, but if >102 Km is covered or <102 Km then that does not influence the model.
The PDP (Partial dependence plot) helps to provide an insight into the threshold values of the features that influence the model.
Now we can use the same code for the random forest model and look at the plot :
For the random forest model, the plot looks a bit different and here the performance of the model increases when the distance covered is 99 till about 102; post which the variables has little or no influence on the model as given by the declining trend and the flat line henceforth.
This is how we can use simple PDP plots to understand the behaviour of influential variables in the model. This information can not only draw insights about the variables that impact the model but is especially helpful in training the models and for selection of the right features. The thresholds can also help to create bins that can be used to sub-set the features that can further enhance the accuracy of the model.
Please refer to this link on Github for the full code.
Do reach out to me in case of any questions/comments.
References:
[1] Abraham Itzhak Weinberg, Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification (Feb 2019), Springer
[2] Leo Breiman, Random Forests (Oct 2001), Springer
[3] Alex Goldstein, Adam Kapelner, Justin Bleich, and Emil Pitkin, Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual
Conditional Expectation (Mar 2004), The Wharton School of the University of Pennsylvania, arxiv.org