Predicting the Attrition of Valuable Employees….. In an IT firm, there are many Employee Architectures available. Some IT firms or at particular departments or certain levels follow the structure, in which there is a “star” organisation around a “chief” position designated to the Engineer who best understands the system requirements. chief programmer While, some follow an ( ) structure, where all the Engineers are at the same level designated for different jobs like Front-End Design, Back-End Coding, Software Testing etc. But, this architecture is not followed by very big or Multi-National Software Giants. But all in all, this is a very successful and working environment-friendly architecture. egoless democratic Egoless (Democratic) Architecture 3rd Type of architecture is the structure, which is the combination of the above 2 types. This is the mostly followed architecture and very common among software giants. mixed Mixed Controlled Architecture Likewise, International Business Machine Corporation (IBM) probably follows either or structures. So, for the HR Department, an important responsibility is to measure the attrition of the Employees at specific time-gaps. The factors on which the Employee Attrition depends upon are: egoless mixed Age of the Employee Monthly Income Overtime Monthly Rate Distance from Home Years at Company and so on… IBM also made their Employee Information publicly available, with the problem statement: “ Predict the Attrition of the Employees i.e., will there be attrition of the employees or not, given the Employee Details i.e., the factors responsible for attrition” The Employee Dataset is made available at Kaggle: _Predict attrition of your valuable employees_www.kaggle.com IBM HR Analytics Employee Attrition & Performance A possible solution to solve this problem is by applying Machine Learning i.e., by imparting Machine Intelligence which involves development of a Predictive Model by training it, using the data available and validating it for Model Performance Analysis…. Given below is a step-by-step procedure of Machine Learning Model Development using Python and Scikit-Learn Machine Learning Toolbox: Model Development: # import numpy as npimport pandas as pdimport matplotlib.pyplot as plt%matplotlib inlineimport pylab as plfrom sklearn.metrics import roc_curve, auc importing all the libraries #**loading the dataset using Pandas**df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')df.head()# Output shown below Pandas Dataframe Output of the Dataset # df.shape == df.dropna().shape # Output shown below checking whether the dataset contains any missing values... Hence, there are no missing values present in the dataset. This is a Binary Classification Problem, so the Distribution of instances among the 2 classes, is visualized below: y_bar = np.array([df[df['Attrition'] 'Yes'].shape[0]])x_bar = ['No (0)', 'Yes (1)' 'No'].shape[0],df[df['Attrition'] # plt.bar(x, y)plt.xlabel('Labels/Classes')plt.ylabel('Number of Instances')plt.title('Distribution of Labels/Classes in the Dataset')# Output shown below Bar Visualization Bar Visualization of the Class Distribution # X = df.iloc[:,[0] + list(range(2,35))].valuesy = df.iloc[:,1].valuesfrom sklearn.preprocessing import LabelEncoder, OneHotEncoder Label Encoding for Categorical/Non-Numeric Data labelencoder_X_1 = LabelEncoder()X[:,1] = labelencoder_X_1.fit_transform(X[:,1])X[:,3] = labelencoder_X_1.fit_transform(X[:,3])X[:,6] = labelencoder_X_1.fit_transform(X[:,6])X[:,10] = labelencoder_X_1.fit_transform(X[:,10])X[:,14] = labelencoder_X_1.fit_transform(X[:,14])X[:,16] = labelencoder_X_1.fit_transform(X[:,16])X[:,20] = labelencoder_X_1.fit_transform(X[:,20])X[:,21] = labelencoder_X_1.fit_transform(X[:,21])y = labelencoder_X_1.fit_transform(y) #**Feature Selection using Random Forest Classifier's Feature**#**Importance Scores**from sklearn.ensemble import RandomForestClassifiermodel = RandomForestClassifier()model.fit(X,y) # Output shown below list_importances=list(model.feature_importances_)indices=sorted(range(len(list_importances)), key=lambda k:list_importances[k])feature_selected=[None]*34k=0for i in reversed(indices):if k<=33:feature_selected[k]=ik=k+1X_selected = X[:,feature_selected[:17]]l_features=feature_selectedi=0for x in feature_selected:l_features[i] = df.columns[x]i=i+1l_features = np.array(l_features) #**Extracting 17 most important features among 34 features**l_features[:17] #Output shown below #**Selecting the 17 most important features**df_features = pd.DataFrame(X_selected, columns=['Age','MonthlyIncome', 'OverTime','EmployeeNumber', 'MonthlyRate',, 'DistanceFromHome', 'YearsAtCompany','TotalWorkingYears', 'DailyRate','HourlyRate', 'NumCompaniesWorked','JobInvolvement', 'PercentSalaryHike','StockOptionLevel','YearsWithCurrManager','EnvironmentSatisfaction','EducationField', 'Attrition']]df_selected.head() # Output shown below So, again label encoding has to be done for the selected categorical features: #**Label Encoding for selected Non-Numeric Features:**X = df_selected.iloc[:,list(range(0,17))].valuesy = df_selected.iloc[:,17].values X[:,2] = labelencoder_X_1.fit_transform(X[:,2])X[:,16] = labelencoder_X_1.fit_transform(X[:,16])y = labelencoder_X_1.fit_transform(y) Now the Data Pre-Processing Steps are over. Let’s move on to Model Training:- from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2, random_state=1753) #80-20 splitting where 80% Data is for Training the Model#and 20% Data is for Validation and Performance Analysis #**Using Logistic Regression Algorithm for Model Training**from sklearn.linear_model import LogisticRegressionclf= LogisticRegression(verbose = 3) #**Training the Model**clf_trained = clf.fit(X_train, y_train) #Output shown below This is the Library of Parameter Optimization Strategy used by Logistic Regression 2. Model Performance Analysis: => Training Accuracy clf_trained.score(X_train,y_train) # Output shown below Training Accuracy of 84.44% is achieved by the model => Validation Accuracy clf_trained.score(X_test,y_test) # Output shown below Validation Accuracy of 89.12% is achieved by the model => , and Precision Recall F1-Score # predictions=clf_trained.predict(X_test) getting the predictions... print(classification_report(y_test,predictions)) Classification Report of the model => Confusion Matrix # MODULE FOR CONFUSION MATRIX import matplotlib.pyplot as plt%matplotlib inlineimport numpy as npimport itertoolsdef plot_confusion_matrix(cm, classes,normalize=False,title='Confusion matrix',cmap=plt.cm.Blues):"""This function prints and plots the confusion matrix.Normalization can be applied by setting `normalize=True`."""if normalize:cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]print("Normalized confusion matrix")else:print('Confusion matrix, without normalization') print(cm) plt.imshow(cm, interpolation='nearest', cmap=cmap)plt.title(title)plt.colorbar()tick_marks = np.arange(len(classes))plt.xticks(tick_marks, classes, rotation=45)plt.yticks(tick_marks, classes) fmt = '.2f' if normalize else 'd'thresh = cm.max() / 2.for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):plt.text(j, i, format(cm[i, j], fmt),horizontalalignment="center",color="white" if cm[i, j] > thresh else "black") plt.tight\_layout() plt.ylabel('True label') plt.xlabel('Predicted label') #**Generating the Confusion Matrix**plt.figure() cm = np.array([[252, 1], [31, 10]]) plot_confusion_matrix(confusion_matrix(y_test,predictions),classes=[0,1], normalize=True, title='Normalized Confusion Matrix')# Output shown below Normalized Confusion Matrix => Receiver Operator Characteristic Curve: # y_roc = np.array(y_test)fpr, tpr, thresholds = roc_curve(y_roc, clf_trained.decision_function(X_test))roc_auc = auc(fpr, tpr)pl.clf()pl.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)pl.plot([0, 1], [0, 1], 'k--')pl.xlim([0.0, 1.0])pl.ylim([0.0, 1.0])pl.xlabel('False Positive Rate')pl.ylabel('True Positive Rate')pl.legend(loc="lower right")pl.show() # Output shown below Plotting the ROC Curve Receiver Operator Characteristic Curve (ROC Curve) According to the Performance Analysis, it can be concluded that the Machine Learning Predictive Model has been successful in effectively classifying 89.12% unknown (Validation Set) examples correctly and has shown quite descent statistical figures for different performance metrics. Hence, in this way an Employee Attrition Predictive Model can be developed using Data Analysis and Machine Learning. This model has been deployed in a Web Application by me using php ( ) as back-end with the help of . The link to the Web-App is given below: PHP: Hypertext Preprocessor PHP-ML navocommerce.in IBM-HR-ANALYTICS For Personal Contacts regarding the article or the Web-App or discussions on Machine Learning or any department of Data Science, feel free to reach out to me on LinkedIn . _View Navoneel Chakrabarty's profile on LinkedIn, the world's largest professional community. Navoneel has 3 jobs listed…_www.linkedin.com Navoneel Chakrabarty - Founder - Road To Financial Data Science | LinkedIn