Predicting the Attrition of Valuable Employees….. ------------------------------------------------- In an IT firm, there are many Employee Architectures available. Some IT firms or at particular departments or certain levels follow the _chief programmer_ structure, in which there is a “star” organisation around a “chief” position designated to the Engineer who best understands the system requirements. While, some follow an _egoless_ (_democratic_) structure, where all the Engineers are at the same level designated for different jobs like Front-End Design, Back-End Coding, Software Testing etc. But, this architecture is not followed by very big or Multi-National Software Giants. But all in all, this is a very successful and working environment-friendly architecture.  **Egoless (Democratic) Architecture** 3rd Type of architecture is the _mixed_ structure, which is the combination of the above 2 types. This is the mostly followed architecture and very common among software giants.  **Mixed Controlled Architecture** Likewise, International Business Machine Corporation (IBM) probably follows either _egoless_ or _mixed_ structures. So, for the HR Department, an important responsibility is to measure the attrition of the Employees at specific time-gaps. The factors on which the Employee Attrition depends upon are: 1. _Age of the Employee_ 2. _Monthly Income_ 3. _Overtime_ 4. _Monthly Rate_ 5. _Distance from Home_ 6. _Years at Company_ and so on… IBM also made their Employee Information publicly available, with the problem statement: “_Predict the Attrition of the Employees i.e., will there be attrition of the employees or not, given the Employee Details i.e., the factors responsible for attrition”_ The Employee Dataset is made available at Kaggle: [**IBM HR Analytics Employee Attrition & Performance** _Predict attrition of your valuable employees_www.kaggle.com](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset "https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset")[](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset) A possible solution to solve this problem is by applying Machine Learning i.e., by imparting Machine Intelligence which involves development of a Predictive Model by training it, using the data available and validating it for Model Performance Analysis…. Given below is a step-by-step procedure of Machine Learning Model Development using Python and Scikit-Learn Machine Learning Toolbox: 1. **Model Development:** #**importing all the libraries** import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline import pylab as pl from sklearn.metrics import roc\_curve, auc #**loading the dataset using Pandas **df = pd.read\_csv('WA\_Fn-UseC\_-HR-Employee-Attrition.csv') df.head()# Output shown below  **Pandas Dataframe Output of the Dataset** #**checking whether the dataset contains any missing values...** df.shape == df.dropna().shape # Output shown below  Hence, there are no missing values present in the dataset. This is a Binary Classification Problem, so the Distribution of instances among the 2 classes, is visualized below: y\_bar = np.array(\[df\[df\['Attrition'\]=='No'\].shape\[0\] ,df\[df\['Attrition'\]=='Yes'\].shape\[0\]\]) x\_bar = \['No (0)', 'Yes (1)' #**Bar Visualization** plt.bar(x, y) plt.xlabel('Labels/Classes') plt.ylabel('Number of Instances') plt.title('Distribution of Labels/Classes in the Dataset') \# Output shown below  **Bar Visualization of the Class Distribution** #**Label Encoding for Categorical/Non-Numeric Data** X = df.iloc\[:,\[0\] + list(range(2,35))\].values y = df.iloc\[:,1\].values from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelencoder\_X\_1 = LabelEncoder() X\[:,1\] = labelencoder\_X\_1.fit\_transform(X\[:,1\]) X\[:,3\] = labelencoder\_X\_1.fit\_transform(X\[:,3\]) X\[:,6\] = labelencoder\_X\_1.fit\_transform(X\[:,6\]) X\[:,10\] = labelencoder\_X\_1.fit\_transform(X\[:,10\]) X\[:,14\] = labelencoder\_X\_1.fit\_transform(X\[:,14\]) X\[:,16\] = labelencoder\_X\_1.fit\_transform(X\[:,16\]) X\[:,20\] = labelencoder\_X\_1.fit\_transform(X\[:,20\]) X\[:,21\] = labelencoder\_X\_1.fit\_transform(X\[:,21\]) y = labelencoder\_X\_1.fit\_transform(y) #**Feature Selection using Random Forest Classifier's Feature **#**Importance Scores **from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier() model.fit(X,y) # Output shown below  list\_importances=list(model.feature\_importances\_) indices=sorted(range(len(list\_importances)), key=lambda k :list\_importances\[k\]) feature\_selected=\[None\]\*34 k=0 for i in reversed(indices): if k<=33: feature\_selected\[k\]=i k=k+1 X\_selected = X\[:,feature\_selected\[:17\]\] l\_features=feature\_selected i=0 for x in feature\_selected: l\_features\[i\] = df.columns\[x\] i=i+1 l\_features = np.array(l\_features) #**Extracting 17 most important features among 34 features **l\_features\[:17\] #Output shown below  #**Selecting the 17 most important features **df\_features = pd.DataFrame(X\_selected, columns=\['Age', 'MonthlyIncome', 'OverTime', 'EmployeeNumber', 'MonthlyRate', , 'DistanceFromHome', 'YearsAtCompany', 'TotalWorkingYears', 'DailyRate', 'HourlyRate', 'NumCompaniesWorked', 'JobInvolvement', 'PercentSalaryHike', 'StockOptionLevel', 'YearsWithCurrManager', 'EnvironmentSatisfaction', 'EducationField', 'Attrition'\]\] df\_selected.head() # Output shown below  So, again label encoding has to be done for the selected categorical features: #**Label Encoding for selected Non-Numeric Features: **X = df\_selected.iloc\[:,list(range(0,17))\].values y = df\_selected.iloc\[:,17\].values X\[:,2\] = labelencoder\_X\_1.fit\_transform(X\[:,2\]) X\[:,16\] = labelencoder\_X\_1.fit\_transform(X\[:,16\]) y = labelencoder\_X\_1.fit\_transform(y) Now the Data Pre-Processing Steps are over. Let’s move on to Model Training:- **#80-20 splitting where 80% Data is for Training the Model #and 20% Data is for Validation and Performance Analysis** from sklearn.model\_selection import train\_test\_split X\_train, X\_test, y\_train, y\_test = train\_test\_split(X, y, test\_size=0.2, random\_state=1753) #**Using Logistic Regression Algorithm for Model Training **from sklearn.linear\_model import LogisticRegression clf= LogisticRegression(verbose = 3) #**Training the Model **clf\_trained = clf.fit(X\_train, y\_train) #Output shown below  **This is the Library of Parameter Optimization Strategy used by Logistic Regression** 2\. **Model Performance Analysis:** \=>**Training Accuracy** clf\_trained.score(X\_train,y\_train) # Output shown below  **Training Accuracy of 84.44% is achieved by the model** \=>**Validation Accuracy** clf\_trained.score(X\_test,y\_test) # Output shown below  **Validation Accuracy of 89.12% is achieved by the model** \=>**Precision**, **Recall** and **F1-Score** #**getting the predictions...** predictions=clf\_trained.predict(X\_test) print(classification\_report(y\_test,predictions))  **Classification Report of the model** \=>**Confusion Matrix** #**MODULE FOR CONFUSION MATRIX** import matplotlib.pyplot as plt %matplotlib inline import numpy as np import itertools def plot\_confusion\_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. Normalization can be applied by setting \`normalize=True\`. """ if normalize: cm = cm.astype('float') / cm.sum(axis=1)\[:, np.newaxis\] print("Normalized confusion matrix") else: print('Confusion matrix, without normalization') print(cm) plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick\_marks = np.arange(len(classes)) plt.xticks(tick\_marks, classes, rotation=45) plt.yticks(tick\_marks, classes) fmt = '.2f' if normalize else 'd' thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape\[0\]) , range(cm.shape\[1\])): plt.text(j, i, format(cm\[i, j\], fmt), horizontalalignment="center", color="white" if cm\[i, j\] > thresh else "black") plt.tight\_layout() plt.ylabel('True label') plt.xlabel('Predicted label') #**Generating the Confusion Matrix **plt.figure() cm = np.array(\[\[252, 1\], \[31, 10\]\]) plot\_confusion\_matrix(confusion\_matrix(y\_test,predictions), classes=\[0,1\], normalize=True , title='Normalized Confusion Matrix') \# Output shown below  **Normalized Confusion Matrix** \=>**Receiver Operator Characteristic Curve:** #**Plotting the ROC Curve** y\_roc = np.array(y\_test) fpr, tpr, thresholds = roc\_curve(y\_roc, clf\_trained.decision\_function(X\_test)) roc\_auc = auc(fpr, tpr) pl.clf() pl.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc\_auc) pl.plot(\[0, 1\], \[0, 1\], 'k--') pl.xlim(\[0.0, 1.0\]) pl.ylim(\[0.0, 1.0\]) pl.xlabel('False Positive Rate') pl.ylabel('True Positive Rate') pl.legend(loc="lower right") pl.show() # Output shown below  **Receiver Operator Characteristic Curve (ROC Curve)** According to the Performance Analysis, it can be concluded that the Machine Learning Predictive Model has been successful in effectively classifying 89.12% unknown (Validation Set) examples correctly and has shown quite descent statistical figures for different performance metrics. Hence, in this way an Employee Attrition Predictive Model can be developed using Data Analysis and Machine Learning. This model has been deployed in a Web Application by me using php (**PHP: Hypertext Preprocessor**) as back-end with the help of [**PHP-ML**](https://php-ml.readthedocs.io/en/latest/). The link to the Web-App is given below: [**IBM-HR-ANALYTICS** navocommerce.in](http://navocommerce.in/ibm/ "http://navocommerce.in/ibm/")[](http://navocommerce.in/ibm/) For Personal Contacts regarding the article or the Web-App or discussions on Machine Learning or any department of Data Science, feel free to reach out to me on [**LinkedIn**](https://www.linkedin.com/in/navoneel-chakrabarty-314262129/)**.** [**Navoneel Chakrabarty - Founder - Road To Financial Data Science | LinkedIn** _View Navoneel Chakrabarty's profile on LinkedIn, the world's largest professional community. Navoneel has 3 jobs listed…_www.linkedin.com](https://www.linkedin.com/in/navoneel-chakrabarty-314262129/ "https://www.linkedin.com/in/navoneel-chakrabarty-314262129/")[](https://www.linkedin.com/in/navoneel-chakrabarty-314262129/)