Bolstering an organization’s productivity is probably the first application of Machine Learning (ML) in business that comes to mind. Any company that switches to a more data-driven approach, would first and foremost utilize Machine Learning to help themselves produce more or, more accurately, produce efficiently. Evaluating and quantifying the end result is pretty straightforward. However, there has been an increasing interest in the applications of ML in the human resources department. Gathering employee (or potential employee) data can be a really powerful tool in the hands of HR practitioners to help them accomplish their tasks quicker and in some cases (like hiring) in a less biased manner.
Integrating ML in HR processes is not a distant future theory. It has been embraced by many enterprises already, and the trends show that it will soon become the norm. Most of the use cases fall into three major groups:
Hiring the best available talent for a job is a two-way process. Certain Natural Language Processing tools can be a useful aid to compose a job description that will sound really attractive and interesting to the candidate. On the other hand, ML can make the HR practitioner’s life easier when going through hundreds (or even thousands) of resumes, and filter out those that are not a great match for the role.
Having an algorithm to do this can greatly reduce the unavoidable human bias as well as the amount of time required. An interesting case study is Unilever. They switched to a more data-driven hiring model, with the help of HireVue and this led to £1M+ annual cost savings, 96% candidate completion rates, 90% reduction in time to hire, and 16% increase in new-hire diversity. You can read the case study here.
An ML-based system can process data collected via various intra-company sources and provide insights on how engaged the employees are in their work. These findings are easily missed by the human eye, especially for very big companies.
There are studies available that demonstrate how important development opportunities are for employee happiness. In some age groups (such as the millennials), a robust personalized framework for growth is shown to be more important than the salary. The same applies to the employer. It is for their benefit to keep their staff up-to-date with the latest trends and technologies. Blindly spending money on training their employees is inefficient. Two steps are necessary here:
a) Identify the skill gap on an individual level and
b) Create a learning path for them.
“Individual” is a key word. The development plan must be as personalized as possible. There is no one size fits all solution. This is where ML can help, by analyzing the employee’s data and highlighting the missing or weak skills followed by generating a plan of action and monitoring the progress made. As an example, you can read the case study of Amadeus, a travel technology company, that designed and built the Valamis-based Amadeus Learning Universe (ALU) global learning platform in order to train Amadeus employees, partners, and clients worldwide.
You invested a lot of time and money in finding and attracting a great candidate and they are performing as expected. You want to make sure that they are happy and committed to the company. Losing great talent can not only be frustrating but they can also be really expensive to replace, especially for senior or managerial positions. There is a series of costs involved in that, ranging from finding new people to onboarding and training them.
ML algorithms can help with providing a better vision of which are the most important factors that led to employees leaving in the past, as well as calculating the likelihood of a specific individual leaving in the future, based on past data. In this story, I will focus on job attrition and will demonstrate how we can use ML to help analyze the data and make predictions. The PyCaret library will be used to speed up the process.
Gartner defined the citizen data scientist as a person:
“Who creates or generates models that use advanced diagnostic analytics or predictive and prescriptive capabilities, but whose primary job function is outside the field of statistics and analytics.”
PyCaret is a library that is inspired by the emerging role of citizen data scientists. It is the Python version of the Caret package for R, essentially a wrapper around some of the most popular machine learning libraries, such as scikit-learn, XGBoost, and many others. It greatly simplifies the lifecycle of an ML experiment from model training to deployment. Behind the scenes, PyCaret takes care of a great amount of the necessary coding, exposing easy-to-use functions to the user. This leads to a great reduction of the lines of code that actually need to be written.
From the documentation page:
PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive.
I will be using the IBM HR Analytics Employee Attrition & Performance dataset available from Kaggle, which contains a mixture of numeric and categorical variables. The objective is to predict if an employee will leave the company (attrition). I will use Google Colab to train the model and perform the analysis. For this specific example, the dataset is assumed to be saved at this location:
/content/drive/MyDrive/Data/HR-IBM_ATTRITION/WA_Fn-UseC_-HR-Employee-Attrition.csv
This is a typical binary classification task, where the target variable (Attrition) takes 0/1 values.
!pip install pycaret shap
from pycaret.classification import *
import pandas as pd
from pycaret.utils import enable_colab
enable_colab()
df_ibm = pd.read_csv('/content/drive/MyDrive/Data/HR-IBM_ATTRITION/WA_Fn-UseC_-HR-Employee-Attrition.csv')
df_ibm.info()
After performing EDA, the major insights were:
So we will start by dropping the columns that don’t offer anything to the model
df_ibm.drop(['Over18','EmployeeCount','StandardHours','EmployeeNumber'],axis=1,inplace=True)
Before running any model training, the environment must be set up. This is super easy, all we need is to call the setup()
function, but first, we need to take care of how the variables will be interpreted. PyCaret will infer the various variable types (numeric, categorical, etc). We can override this behavior by providing our own association of variables with data types. Let’s create these lists and dictionaries first and have them ready to be passed as arguments to the setup function.
cat_vars = ['Department','EducationField','Gender','JobRole','MaritalStatus','OverTime',]
ordinal_vars = {
'BusinessTravel' : ['Non-Travel','Travel_Rarely','Travel_Frequently'],
'Education' : [str(x) for x in sorted(df_ibm['Education'].unique())],
'EnvironmentSatisfaction' : [str(x) for x in sorted(df_ibm['EnvironmentSatisfaction'].unique())],
'JobInvolvement' : [str(x) for x in sorted(df_ibm['JobInvolvement'].unique())],
'JobLevel' : [str(x) for x in sorted(df_ibm['JobLevel'].unique())],
'JobSatisfaction' : [str(x) for x in sorted(df_ibm['JobSatisfaction'].unique())],
'PerformanceRating' : [str(x) for x in sorted(df_ibm['PerformanceRating'].unique())],
'RelationshipSatisfaction' : [str(x) for x in sorted(df_ibm['RelationshipSatisfaction'].unique())],
'StockOptionLevel' : [str(x) for x in sorted(df_ibm['StockOptionLevel'].unique())],
'TrainingTimesLastYear' : [str(x) for x in sorted(df_ibm['TrainingTimesLastYear'].unique())],
'WorkLifeBalance' : [str(x) for x in sorted(df_ibm['WorkLifeBalance'].unique())]
}
numeric_features = ['DailyRate','DistanceFromHome','Age','HourlyRate','MonthlyIncome','MonthlyRate','NumCompaniesWorked','PercentSalaryHike','TotalWorkingYears','YearsAtCompany','YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager']
We are now ready to set up the environment.
experiment = setup(df_ibm,target=‘Attrition',
categorical_features=cat_vars,
train_size=0.8,
ordinal_features=ordinal_vars,
remove_multicollinearity=True,
multicollinearity_threshold = 0.9,
transformation=True,
normalize = True,
numeric_features=numeric_features,
session_id = 42)
Let me explain some of the parameters of this setup()
method:
To get a list of all the available classification models, we can simply run :
models()
which returns:
We can conveniently see how all the above models perform on the train dataset:
top3 = compare_models(n_select=3,sort='AUC')
This function trains and evaluates the performance of all estimators available in the model library using cross-validation (10 fold is the default). The output of this function is a score grid with average cross-validated scores. In this example, I chose AUC to be the metric that models are sorted by. The top3 variable is now a list that holds the 3 best performing models, and they can be referenced like normal Python list elements. Can you imagine? With a single line of code, we trained all the models and generated a metrics grid. Nice.
It is worth mentioning here that PyCaret allows individual model creation very easily, as well as various ensembling techniques (boosting, bagging, blending, stacking). Check the documentation on how to perform these steps. Just to give you an idea, we could blend the top3 models like this :
blend = blend_models(top3)
But for now, we will use the top model, Logistic Regression. The next step would be to try and fine-tune the model. PyCaret has us covered.
tuned_lr = tune_model(top3[0],n_iter=70,optimize=‘AUC')
This function tunes the hyperparameters of a given estimator. The output of this function is a score grid with CV scores by fold of the best selected model based on optimize parameter
The n_iter parameter is the number of iterations in the grid search. Increasing n_iter may improve model performance but also increases the training time.
By default, tune_model performs scikit-learn random grid search.
We are now ready to evaluate the model’s performance on the holdout set (remember, PyCaret created one when we ran the setup). This is again very easy:
pred_holdout = predict_model(tuned_lr)
PyCaret takes the burden off us to create the plots that would visually depict the model’s performance. Let’s see some examples:
plot_model(tuned_lr,’confusion_matrix')
plot_model(tuned_lr,'auc',save=True)
plot_model(tuned_lr,'learning',save=True)
Before saving the model, one last useful step is to run the finalize()
function. Once the predictions are generated on the hold-out set using predict_model
and we have chosen to deploy the specific model, we want to train our model for one final time on the entire dataset including hold-out.
final_tuned_lr = finalize_model(tuned_lr)
To save the final model, we need to run:
save_model(final_tuned_lr, 'final_tuned_lr')
Which saves the entire transformation pipeline and trained model object as a transferable binary pickle file for later use.
As you can imagine, loading the saved model is as easy as:
final_tuned_lr_loaded = load_model('final_tuned_lr')
There is another feature I would like to show you, which is model interpretation. The implementation of the interpretations is based on SHAP (SHapley Additive exPlanations). SHAP values are a widely used approach from cooperative game theory that come with desirable properties. They help us demystify how the model works and how each feature affects the final prediction. However, only tree-based models for binary classification are supported (Extra Trees Classifier, Decision Tree Classifier, Random Forest Classifier, and Light Gradient Boosting Machine). Let’s quickly demonstrate how straightforward it is for PyCaret to interpret a model. We will train and tune a Random Forest Classifier first. Instead of using the compare_models() function, we can create individual models like so:
rf = create_model(‘rf')
and then tune it:
tuned_rf = tune_model(rf,optimize=‘AUC')
Next, we will generate the SHAP values for each feature:
interpret_model(tuned_rf,save=True)
Now, how do we read this plot? The logic is that:
Looking at the graph, we can conclude that:
Features that have a great effect on attrition
Features that have a medium effect on attrition
I encourage you to check the PyCaret documentation page and try some of their tutorials. I just covered a small fraction of what the library can do here.