A Machine Learning Approach to IBM Employee Attrition and Performance

Predicting the Attrition of Valuable Employees…..

In an IT firm, there are many Employee Architectures available. Some IT firms or at particular departments or certain levels follow the chief programmer structure, in which there is a “star” organisation around a “chief” position designated to the Engineer who best understands the system requirements.

While, some follow an egoless (democratic) structure, where all the Engineers are at the same level designated for different jobs like Front-End Design, Back-End Coding, Software Testing etc. But, this architecture is not followed by very big or Multi-National Software Giants. But all in all, this is a very successful and working environment-friendly architecture.

Egoless (Democratic) Architecture

3rd Type of architecture is the mixed structure, which is the combination of the above 2 types. This is the mostly followed architecture and very common among software giants.

Mixed Controlled Architecture

Likewise, International Business Machine Corporation (IBM) probably follows either egoless or mixed structures. So, for the HR Department, an important responsibility is to measure the attrition of the Employees at specific time-gaps. The factors on which the Employee Attrition depends upon are:

Age of the Employee
Monthly Income
Overtime
Monthly Rate
Distance from Home
Years at Company

and so on…

IBM also made their Employee Information publicly available, with the problem statement:

“Predict the Attrition of the Employees i.e., will there be attrition of the employees or not, given the Employee Details i.e., the factors responsible for attrition”

The Employee Dataset is made available at Kaggle:

IBM HR Analytics Employee Attrition & Performance_Predict attrition of your valuable employees_www.kaggle.com

A possible solution to solve this problem is by applying Machine Learning i.e., by imparting Machine Intelligence which involves development of a Predictive Model by training it, using the data available and validating it for Model Performance Analysis….

Given below is a step-by-step procedure of Machine Learning Model Development using Python and Scikit-Learn Machine Learning Toolbox:

Model Development:

#importing all the librariesimport numpy as npimport pandas as pdimport matplotlib.pyplot as plt%matplotlib inlineimport pylab as plfrom sklearn.metrics import roc_curve, auc

#**loading the dataset using Pandas**df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')df.head()# Output shown below

Pandas Dataframe Output of the Dataset

#checking whether the dataset contains any missing values...df.shape == df.dropna().shape # Output shown below

Hence, there are no missing values present in the dataset.

This is a Binary Classification Problem, so the Distribution of instances among the 2 classes, is visualized below:

y_bar = np.array([df[df['Attrition']'No'].shape[0],df[df['Attrition']'Yes'].shape[0]])x_bar = ['No (0)', 'Yes (1)'

#Bar Visualizationplt.bar(x, y)plt.xlabel('Labels/Classes')plt.ylabel('Number of Instances')plt.title('Distribution of Labels/Classes in the Dataset')# Output shown below

Bar Visualization of the Class Distribution

#Label Encoding for Categorical/Non-Numeric DataX = df.iloc[:,[0] + list(range(2,35))].valuesy = df.iloc[:,1].valuesfrom sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder_X_1 = LabelEncoder()X[:,1] = labelencoder_X_1.fit_transform(X[:,1])X[:,3] = labelencoder_X_1.fit_transform(X[:,3])X[:,6] = labelencoder_X_1.fit_transform(X[:,6])X[:,10] = labelencoder_X_1.fit_transform(X[:,10])X[:,14] = labelencoder_X_1.fit_transform(X[:,14])X[:,16] = labelencoder_X_1.fit_transform(X[:,16])X[:,20] = labelencoder_X_1.fit_transform(X[:,20])X[:,21] = labelencoder_X_1.fit_transform(X[:,21])y = labelencoder_X_1.fit_transform(y)

#**Feature Selection using Random Forest Classifier's Feature**#**Importance Scores**from sklearn.ensemble import RandomForestClassifiermodel = RandomForestClassifier()model.fit(X,y) # Output shown below

list_importances=list(model.feature_importances_)indices=sorted(range(len(list_importances)), key=lambda k:list_importances[k])feature_selected=[None]*34k=0for i in reversed(indices):if k<=33:feature_selected[k]=ik=k+1X_selected = X[:,feature_selected[:17]]l_features=feature_selectedi=0for x in feature_selected:l_features[i] = df.columns[x]i=i+1l_features = np.array(l_features)

#**Extracting 17 most important features among 34 features**l_features[:17] #Output shown below

#**Selecting the 17 most important features**df_features = pd.DataFrame(X_selected, columns=['Age','MonthlyIncome', 'OverTime','EmployeeNumber', 'MonthlyRate',, 'DistanceFromHome', 'YearsAtCompany','TotalWorkingYears', 'DailyRate','HourlyRate', 'NumCompaniesWorked','JobInvolvement', 'PercentSalaryHike','StockOptionLevel','YearsWithCurrManager','EnvironmentSatisfaction','EducationField', 'Attrition']]df_selected.head() # Output shown below

So, again label encoding has to be done for the selected categorical features:

#**Label Encoding for selected Non-Numeric Features:**X = df_selected.iloc[:,list(range(0,17))].valuesy = df_selected.iloc[:,17].values

X[:,2] = labelencoder_X_1.fit_transform(X[:,2])X[:,16] = labelencoder_X_1.fit_transform(X[:,16])y = labelencoder_X_1.fit_transform(y)

Now the Data Pre-Processing Steps are over. Let’s move on to Model Training:-

#80-20 splitting where 80% Data is for Training the Model#and 20% Data is for Validation and Performance Analysisfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2, random_state=1753)

#**Using Logistic Regression Algorithm for Model Training**from sklearn.linear_model import LogisticRegressionclf= LogisticRegression(verbose = 3)

#**Training the Model**clf_trained = clf.fit(X_train, y_train) #Output shown below

This is the Library of Parameter Optimization Strategy used by Logistic Regression

2. Model Performance Analysis:

=>Training Accuracy

clf_trained.score(X_train,y_train) # Output shown below

Training Accuracy of 84.44% is achieved by the model

=>Validation Accuracy

clf_trained.score(X_test,y_test) # Output shown below

Validation Accuracy of 89.12% is achieved by the model

=>Precision, Recall and F1-Score

#getting the predictions...predictions=clf_trained.predict(X_test)

print(classification_report(y_test,predictions))

Classification Report of the model

=>Confusion Matrix

#MODULE FOR CONFUSION MATRIX

import matplotlib.pyplot as plt%matplotlib inlineimport numpy as npimport itertoolsdef plot_confusion_matrix(cm, classes,normalize=False,title='Confusion matrix',cmap=plt.cm.Blues):"""This function prints and plots the confusion matrix.Normalization can be applied by setting `normalize=True`."""if normalize:cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]print("Normalized confusion matrix")else:print('Confusion matrix, without normalization')

print(cm)

plt.imshow(cm, interpolation='nearest', cmap=cmap)plt.title(title)plt.colorbar()tick_marks = np.arange(len(classes))plt.xticks(tick_marks, classes, rotation=45)plt.yticks(tick_marks, classes)

fmt = '.2f' if normalize else 'd'thresh = cm.max() / 2.for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):plt.text(j, i, format(cm[i, j], fmt),horizontalalignment="center",color="white" if cm[i, j] > thresh else "black")

plt.tight\_layout()  
plt.ylabel('True label')  
plt.xlabel('Predicted label')

#**Generating the Confusion Matrix**plt.figure()

cm = np.array([[252, 1], [31, 10]])

plot_confusion_matrix(confusion_matrix(y_test,predictions),classes=[0,1], normalize=True, title='Normalized Confusion Matrix')# Output shown below

Normalized Confusion Matrix

=>Receiver Operator Characteristic Curve:

#Plotting the ROC Curvey_roc = np.array(y_test)fpr, tpr, thresholds = roc_curve(y_roc, clf_trained.decision_function(X_test))roc_auc = auc(fpr, tpr)pl.clf()pl.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)pl.plot([0, 1], [0, 1], 'k--')pl.xlim([0.0, 1.0])pl.ylim([0.0, 1.0])pl.xlabel('False Positive Rate')pl.ylabel('True Positive Rate')pl.legend(loc="lower right")pl.show() # Output shown below

Receiver Operator Characteristic Curve (ROC Curve)

According to the Performance Analysis, it can be concluded that the Machine Learning Predictive Model has been successful in effectively classifying 89.12% unknown (Validation Set) examples correctly and has shown quite descent statistical figures for different performance metrics.

Hence, in this way an Employee Attrition Predictive Model can be developed using Data Analysis and Machine Learning.

This model has been deployed in a Web Application by me using php (PHP: Hypertext Preprocessor) as back-end with the help of PHP-ML. The link to the Web-App is given below:

IBM-HR-ANALYTICSnavocommerce.in

For Personal Contacts regarding the article or the Web-App or discussions on Machine Learning or any department of Data Science, feel free to reach out to me on LinkedIn.

Navoneel Chakrabarty - Founder - Road To Financial Data Science | LinkedIn_View Navoneel Chakrabarty's profile on LinkedIn, the world's largest professional community. Navoneel has 3 jobs listed…_www.linkedin.com