Have you ever wondered how the machine learning models are constructed? Today we will explore this and learn some quick techniques on how to find out which variables are influencing the model results and by how much. We will use the FIFA 2018 on Kaggle and explore the following models: dataset Decision Tree model Random Forest model This will be the agenda for today: Use the FIFA dataset to train the decision tree model Use the FIFA dataset to train the random forest model Explore the influential variables in the models Find the threshold of the influential variables So without further ado let's get started. 1. Use the FIFA dataset to train the decision tree model Let us first talk a bit about decision trees. Decision tree algorithms start with a root node from a data sample and then select features based on metrics like Gini impurity or information gain and splits the root nodes into leaf nodes/end nodes till no more split is possible. This is illustrated in the diagram below with a sample tree. After the data and the libraries have been imported, the following lines of code will help to train the decision tree model. y = (data[ ] == ) feature_names = [i i data.columns data[i].dtype [np.int64]] x = data[feature_names] train_x, test_x, train_y, test_y = train_test_split(x,y, random_state= ) dt_model = DecisionTreeClassifier(random_state= , max_depth= , min_samples_split= ).fit(train_x,train_y) pred_y = dt_model.predict(test_x) cm = confusion_matrix(test_y,pred_y) print(cm) accuracy_score(test_y,pred_y) #Create the dependent variable 'Man of the Match' "Yes" #Create the independent variable for in if in #Train the decision tree model 1 0 5 5 We will get the following output from the confusion matrix: [[ ] [ ]] 9 7 6 10 0.59375 The accuracy of the decision tree model is moderate at 59.38% with (10+9) targets being corrected predicted and (7+6) being false positives and false negatives respectively. 2. Use the FIFA dataset to train the random forest model Let us now learn a bit about the random forest model and then train the data with it. is an ensemble learning algorithm that works by constructing multiple decision trees and outputs the class that is either the mode or the mean prediction of the individual decision trees. Random forest An illustration is given below: We will now use the code below to train the random forest model. # Train the RF model rf_model = RandomForestClassifier(n_estimators= , random_state= ).fit(train_x,train_y) pred_y = rf_model.predict(test_x) cm = confusion_matrix(test_y,pred_y) print(cm) accuracy_score(test_y,pred_y) 100 1 The output of the Random forest model is given below: [[ ] [ ]] 10 6 3 13 0.71875 The random forest model has a better accuracy at 71.88% with (10+13) targets identified correctly and (6+3) targets mis-classified-6 being false positives and 3 being false negatives. 3. Explore the influential variables in the models We will now look at the most influential variables in both the models and how they are affecting the accuracy. We will use ' ' from the ' ' library for this purpose. We can do this with only one line of code as given below. PermutationImportance eli5 # Import PermutationImportance the eli5 library eli5.sklearn PermutationImportance # Influential variables Decision Tree model eli5.show_weights(perm, feature_names = test_x.columns.tolist()) from from import for The influential variables in the decision tree model is : The most influential variables in the decision tree model is 'Goal scored', 'On-target', 'Distance Covered (Kms)' and 'Off-Target'. There are also variables that influence the accuracy negatively like 'Attempts' and 'Corners' - hence we can drop these variables from the model to increase the accuracy. Some variables like 'Red', 'Ball Possession %' etc has no influence on the accuracy of the model. The weights indicate by how much percentage the model accuracy is impacted by the variable when the variables are re-shuffled. For eg: By using the feature 'Goal Scored' the model accuracy can be improved by 14.37% in a range of (+-) 11.59% . The influential variables in the random forest model is : As you can observe there are significant differences in the variables that influence the 2 models and for the same variable like say 'Goal Scored' the percentage of change in accuracy also differs. 4. Find out the threshold of the influential variable at which the changes to model accuracy is happening Let us now take one variable say 'Distance Covered (Kms)' and try to find out the threshold at which the accuracy increases. We can do this easily with Partial dependence plots (PDP). A partial dependence (PD) plot depicts the functional relationship between input variables and predictions. It shows how the predictions partially depend on values of the input variables. For eg: We can create a partial dependence plot of the variable 'Distance Covered (Kms)' to understand how changes in the values of the variable 'Distance Covered (Kms)' affects overall accuracy of the model. We will start with the decision tree model first. matplotlib pyplot plt pdpbox pdp, get_dataset, info_plots feature_to_plot = pdp_dist = pdp.pdp_isolate(model=dt_model,dataset=test_x, model_features=feature_names, feature= feature_to_plot) pdp.pdp_plot(pdp_dist, feature_to_plot) plt.show() # Import the libraries from import as from import # Select the variable/feature to plot 'Distance Covered (Kms)' # PDP plot for Decision tree model The plot will look like this: If distance covered is 102 KM, then that influences the model positively, but if >102 Km is covered or <102 Km then that does not influence the model. The PDP (Partial dependence plot) helps to provide an insight into the threshold values of the features that influence the model . Now we can use the same code for the random forest model and look at the plot : For the random forest model, the plot looks a bit different and here the performance of the model increases when the distance covered is 99 till about 102; post which the variables has little or no influence on the model as given by the declining trend and the flat line henceforth. Summary: This is how we can use simple PDP plots to understand the behaviour of influential variables in the model. This information can not only draw insights about the variables that impact the model but is especially helpful in training the models and for selection of the right features. The thresholds can also help to create bins that can be used to sub-set the features that can further enhance the accuracy of the model. Please refer to this on Github for the full code. link Do reach out to me in case of any questions/comments. References: [1] Abraham Itzhak Weinberg, (Feb 2019), Springer Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification [2] Leo Breiman, (Oct 2001), Springer Random Forests [3] Alex Goldstein, Adam Kapelner, Justin Bleich, and Emil Pitkin, (Mar 2004), The Wharton School of the University of Pennsylvania, arxiv.org Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation