Logistic Regression, KNN, Decision Tree, Random Forest, SVC, Linear SVC, GaussianNB, BernouliNB Python codes are available: https://github.com/JNYH/Project-McNulty Customer turnover, or churn rate, is the percentage of an organization’s customer base lost during a given period of time, usually a month or annual basis. A bad churn rate can be very damaging to revenue and profitability. The goal of this project is to predict customer churn in a Telecommunication company. We will explore 8 predictive analytic models to assess customers’ propensity or risk to churn. These models can generate a list of customers who are most vulnerable to churn, so that business can work towards retaining them. pandas pd numpy np matplotlib.pyplot plt %matplotlib inline %config InlineBackend.figure_formats = [‘retina’] seaborn sns time warnings warnings.filterwarnings(“ignore”) sklearn.model_selection KFold, cross_val_score sklearn.linear_model LogisticRegression sklearn.neighbors KNeighborsClassifier sklearn.naive_bayes GaussianNB, BernoulliNB, MultinomialNB sklearn.tree DecisionTreeClassifier sklearn.ensemble RandomForestClassifier sklearn.svm LinearSVC, SVC sklearn metrics sklearn.metrics confusion_matrix, classification_report sklearn.metrics accuracy_score, precision_score, recall_score, f1_score, log_loss, fbeta_score sklearn.metrics auc, roc_curve, roc_auc_score, precision_recall_curve import as import as import as import as import import from import from import from import from import from import from import from import from import from import from import from import This project is based on from . IBM sample datasets Kaggle df = pd.read_csv(‘WA_Fn-UseC_-Telco-Customer-Churn.csv’) df[ ] = df[ ].replace( ,np.nan) df = df.dropna(how = ) df[ ] = df[ ].astype(float) ( , df.shape[ ]) ( , df.shape[ ]) ( , df.columns.tolist()) ( , df.isnull().sum().values.sum()) ( , df.nunique()) df.info() df.isnull().sum() # remove 11 rows with spaces in TotalCharges (0.15% missing data) 'TotalCharges' 'TotalCharges' ' ' 'any' 'TotalCharges' 'TotalCharges' # data overview print 'Rows : ' 0 print 'Columns : ' 1 print '\nFeatures : \n' print '\nMissing values : ' print '\nUnique values : \n' In this dataset of over 7000 customers, 26% of them has left in the last month. This is critical to the Telco business because it is often more expensive to acquire new customers than to keep existing ones. print(df.Churn.value_counts()) df[‘Churn’].value_counts().plot(‘bar’).set_title(‘Churn’) The features in this dataset include the following: · : Gender, SeniorCitizen, Partner, Dependents · : PhoneService, MultipleLine, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies · : CustomerID, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges, Tenure demographic data subscribed services customer account information Target is Churn, which has binary classes 1 and 0. df[‘SeniorCitizen’] = df[‘SeniorCitizen’].replace({ :’Yes’, :’No’}) num_cols = [‘tenure’, ‘MonthlyCharges’, ‘TotalCharges’] df[num_cols].describe() # replace values for SeniorCitizen as a categorical feature 1 0 How do we select features? Features are selected mainly based on coefficient and using . The p-values and coefficients from have also been considered. Lasso Random Forest Statsmodels df1 = pd.read_csv(‘df1.csv’) X, y = df1.drop(‘Churn’,axis= ), df1[[‘Churn’]] sklearn.linear_model LassoCV, RidgeCV, ElasticNetCV sklearn.preprocessing StandardScaler, PolynomialFeatures print(‘Use LassoCV to find the optimal ALPHA value L1 regularization’) std = StandardScaler() std.fit(X.values) X_scaled = std.transform(X.values) print(‘X_scaled’, X_scaled.shape) alphavec = **np.linspace( , , ) lasso_model = LassoCV(alphas = alphavec, cv= ) lasso_model.fit(X_scaled, y) print(‘LASSO best alpha: ‘, lasso_model.alpha_ ) list(zip(X.columns, lasso_model.coef_)) 1 ## to find significant features using LassoCV (all X_scaled) from import from import for 10 -3 3 200 5 # display all coefficients in the model with optimal alpha rfc = RandomForestClassifier(random_state= , n_estimators= ) model = rfc.fit(X, y) (pd.Series(model.feature_importances_, index=X.columns) .nlargest( ) .plot(kind= , figsize=[ , ]) .invert_yaxis()) plt.yticks(size= ) plt.title( , size= ) ## To look for top features using Random Forest 0 100 47 'barh' 20 15 15 'Top Features derived by Random Forest' 20 Features are dropped when they do not contribute significantly to the model. Here are some examples (each chart shows the distribution of these over 7000 customers). For example, gender (whether male or female) and phone related services, customers are equally likely to churn, because the ratio of churn and non-churn customers are the same. categorical_features = [ ‘gender’, ‘SeniorCitizen’, ‘Partner’, ‘Dependents’, ‘PhoneService’, ‘MultipleLines’, ‘InternetService’, ‘OnlineSecurity’, ‘OnlineBackup’, ‘DeviceProtection’, ‘TechSupport’, ‘StreamingTV’, ‘StreamingMovies’, ‘PaymentMethod’, ‘PaperlessBilling’, ‘Contract’ ] ROWS, COLS = , fig, ax = plt.subplots(ROWS, COLS, figsize=( , ) ) row, col = , i, categorical_feature enumerate(categorical_features): col == COLS — : row += col = i % COLS df[df.Churn==’No’][categorical_feature].value_counts().plot(‘bar’, width= , ax=ax[row, col], color=’blue’, alpha= ).set_title(categorical_feature) df[df.Churn==’Yes’][categorical_feature].value_counts().plot(‘bar’, width= , ax=ax[row, col], color=’orange’, alpha= ).set_title(categorical_feature) plt.legend([‘No Churn’, ‘Churn’]) fig.subplots_adjust(hspace= ) # To analyse categorical feature distribution 4 4 18 20 0 0 for in if 1 1 .5 0.5 .3 0.7 0.7 What are good features to keep? A good feature is when we can distinguish between churn and non-churn customers, especially when the ratio is different. For example, those with month-to-month contract are more likely to leave, which is logical as they are not bounded by yearly contracts. Feature engineering These 3 features Tenure, Monthly Charges and Total Charges are continuous data to be split into categories. When I looked at the pair plot, it seems possible to separate the churn (orange) and non-churn (blue) customers. sns.pairplot(df[[ , , , ]], hue= , plot_kws=dict(alpha= , edgecolor= ), height= , aspect= ) 'tenure' 'MonthlyCharges' 'TotalCharges' 'Churn' 'Churn' .3 'none' 2 1.1 So, I have carefully selected these separation boundaries (below) to attempt to separate the churn and non-churn cases. fig, ax = plt.subplots( , , figsize=( , )) df[df.Churn == “No”][num_cols].hist(bins= , color=”blue”, alpha= , ax=ax) df[df.Churn == “Yes”][num_cols].hist(bins= , color=”orange”, alpha= , ax=ax) plt.legend([‘No Churn’, ‘Churn’], shadow= , loc= ) 1 3 15 3 35 0.5 35 0.7 True 9 df[‘MonthlyCharges’] <= : ‘ – ’ (df[‘MonthlyCharges’] > ) & (df[‘MonthlyCharges’] <= ): ‘ – ’ (df[‘MonthlyCharges’] > ) & (df[‘MonthlyCharges’] <= ): ‘ – ’ df[‘MonthlyCharges’] > : ‘ plus’ df[‘monthlycharges_group’] = df.apply( df:monthlycharges_split(df), axis = ) df[‘TotalCharges’] <= : ‘ – k’ (df[‘TotalCharges’] > ) & (df[‘TotalCharges’] <= ): ‘ k k’ (df[‘TotalCharges’] > ) & (df[‘TotalCharges’] <= ) : ‘ k k’ df[‘TotalCharges’] > : ‘ kplus’ df[‘totalcharges_group’] = df.apply( df:totalcharges_split(df), axis = ) df[‘tenure’] <= : ‘ – ’ (df[‘tenure’] > ) & (df[‘tenure’] <= ): ‘ – ’ (df[‘tenure’] > ) & (df[‘tenure’] <= ) : ‘ – ’ df[‘tenure’] > : ‘ plus’ df[‘tenure_group’] = df.apply( df:tenure_split(df), axis = ) # change MonthlyCharges to categorical column : def monthlycharges_split (df) if 30 return 0 30 elif 30 70 return 30 70 elif 70 99 return 70 99 elif 99 return 99 lambda 1 # change TotalCharges to categorical column : def totalcharges_split (df) if 2000 return 0 2 elif 2000 4000 return 2 -4 elif 4000 6000 return 4 -6 elif 6000 return 6 lambda 1 # change Tenure to categorical column : def tenure_split (df) if 20 return 0 20 elif 20 40 return 20 40 elif 40 60 return 40 60 elif 60 return 60 lambda 1 Model selection I have tested 8 classification models, trained them, optimised them, and here are the results. Logistic Regression generally works well on binary classification and has performed better than the rest. is a balance between Precision and Recall rates trade-off. F1-Score sklearn.model_selection train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= , random_state= ) print(‘X_train’, X_train.shape) print(‘y_train’, y_train.shape) print(‘X_test’, X_test.shape) print(‘y_test’, y_test.shape) print(‘\nSearch OPTIMAL THRESHOLD, vary to , fit/predict on train/test data’) model.fit(X_train, y_train) optimal_th = i range( , ): score_list = [] print(‘\nLooping decimal place’, i+ ) th_list = [np.linspace(optimal_th , optimal_th+ , ), np.linspace(optimal_th , optimal_th+ , ), np.linspace(optimal_th , optimal_th+ , )] th th_list[i]: y_pred = (model.predict_proba(X_test)[:, ] >= th) f1scor = f1_score(y_test, y_pred) score_list.append(f1scor) print(‘{: f}->{: f}’.format(th, f1scor), end=’, ‘) optimal_th = float(th_list[i][score_list.index(max(score_list))]) print(‘optimal F1 score = {: f}’.format(max(score_list))) print(‘optimal threshold = {: f}’.format(optimal_th)) print(model_name, ‘accuracy score ’) print(‘Training: {: f}%’.format( *model.score(X_train, y_train))) print(‘Test set: {: f}%’.format( *model.score(X_test, y_test))) y_pred = (model.predict_proba(X_test)[:, ] >= ) print(‘\nAdjust threshold to :’) print(‘Precision: {: f}, Recall: {: f}, F1 Score: {: f}’.format( precision_score(y_test, y_pred), recall_score(y_test, y_pred), f1_score(y_test, y_pred))) print(model_name, ‘confusion matrix: \n’, confusion_matrix(y_test, y_pred)) y_pred = model.predict(X_test) print(‘\nDefault threshold of :’) print(‘Precision: {: f}, Recall: {: f}, F1 Score: {: f}’.format( precision_score(y_test, y_pred), recall_score(y_test, y_pred), f1_score(y_test, y_pred))) print(model_name, ‘confusion matrix: \n’, confusion_matrix(y_test, y_pred)) y_pred = (model.predict_proba(X_test)[:, ] >= ) print(‘\nAdjust threshold to :’) print(‘Precision: {: f}, Recall: {: f}, F1 Score: {: f}’.format( precision_score(y_test, y_pred), recall_score(y_test, y_pred), f1_score(y_test, y_pred))) print(model_name, ‘confusion matrix: \n’, confusion_matrix(y_test, y_pred)) y_pred = (model.predict_proba(X_test)[:, ] >= optimal_th) print(‘\nOptimal threshold {: f}’.format(optimal_th)) print(‘Precision: {: f}, Recall: {: f}, F1 Score: {: f}’.format( precision_score(y_test, y_pred), recall_score(y_test, y_pred), f1_score(y_test, y_pred))) print(model_name, ‘confusion matrix: \n’, confusion_matrix(y_test, y_pred)) model_f1, model_auc, model_ll, model_roc_auc model_f1 = f1_score(y_test, y_pred) y_pred = model.predict_proba(X_test) model_ll = log_loss(y_test, y_pred) print(model_name, ‘Log-loss: {: f}’.format(model_ll)) y_pred = model.predict(X_test) model_roc_auc = roc_auc_score(y_test, y_pred) print(model_name, ‘roc_auc_score: {: f}’.format(model_roc_auc)) y_pred = model.predict_proba(X_test)[:, ] fpr, tpr, thresholds = roc_curve(y_test, y_pred) model_auc = auc(fpr, tpr) print(model_name, ‘AUC: {: f}’.format(model_auc)) plt.figure(figsize = [ , ]) plt.plot(fpr, tpr, label=’ROC curve (area = % f)’ % model_auc) plt.plot([ , ], [ , ],’r — ‘) plt.xlim([ , ]) plt.ylim([ , ]) plt.xlabel(‘ Positive Rate’) plt.ylabel(‘ Positive Rate’) plt.title(‘Receiver Operating Characteristic’) plt.legend(loc=”lower right”) plt.show() model_list = [] f1_list = [] auc_list = [] ll_list = [] roc_auc_list = [] time_list = [] # split data to 80:20 ratio for train/test from import .2 71 : def model_report (model_name, model) for from 0.0001 0.9999 0.5 # start with default threshold value for in 0 3 1 -0.4999 0.4999 11 -0.1 0.1 21 -0.01 0.01 21 for in 1 .3 .4 .4 .3 is .2 100 # score uses accuracy .2 100 1 0.25 0.25 .4 .4 .4 0.50 .4 .4 .4 1 0.75 0.75 .4 .4 .4 1 .3 .4 .4 .4 global .4 .4 1 .4 # plot the ROC curve 6 6 0.2 0 1 0 1 0.0 1.0 0.0 1.0 False True # plt.savefig(‘roc_auc_score’) return # initialise lists to collect the results to plot later print(‘\n”””””” LogisticRegression “”””””’) print(‘\nSearch optimal hyperparameter C LogisticRegresssion, vary C to , using KFold( ) Cross Validation on train data’) kf = KFold(n_splits= , random_state= , shuffle= ) score_list = [] c_list = **np.linspace( , , ) c c_list: logit = LogisticRegression(C = c) cvs = (cross_val_score(logit, X_train, y_train, cv=kf, scoring=’f1 for in from 0.001 1000 5 5 21 True 10 -3 3 200 for in ')).mean() score_list.append(cvs) print(‘{:.4f}’.format(cvs), end=”, “) # 4 decimal pl print(‘optimal cv F1 score = {:.4f}’.format(max(score_list))) optimal_c = float(c_list[score_list.index(max(score_list))]) print(‘optimal value of C = {:.3f}’.format(optimal_c)) time1 = time.time() logit = LogisticRegression(C = optimal_c) model_report(‘LogisticRegression’, logit) model_list.append(‘LogisticRegression’) f1_list.append(model_f1) auc_list.append(model_auc) ll_list.append(model_ll) roc_auc_list.append(model_roc_auc) time_list.append(time.time() — time1) fig, ax = plt.subplots( , , figsize=( , )) ax[ ].bar(model_list, f1_list) ax[ ].set_title( ) ax[ ].bar(model_list, auc_list) ax[ ].set_title( ); ax[ ].bar(model_list, ll_list) ax[ ].set_title( ) ax[ ].bar(model_list, roc_auc_list) ax[ ].set_title( ) ax[ ].bar(model_list, time_list) ax[ ].set_title( ) fig.subplots_adjust(hspace= , wspace= ) ## plot the classification report scores 5 1 18 28 0 0 'F1-score' 1 1 'AUC-score' 2 2 'Log-Loss-Score' 3 3 'ROC AUC Score' 4 4 'Time taken' 0.2 0.2 stands for area under curve, which is explained below. The ROC ( ) curve is a relationship between True Positive Rate and False Positive Rate. Logistic Regression, which is the red line, has an area of 0.82 under this curve. It means the model has performed better than random guesses of 0.5, the diagonal line. AUC Receiver Operating Characteristic plt.figure(figsize=( , )) y_pred = gnb.predict_proba(X_test)[:, ] fpr, tpr, thresholds = roc_curve(y_test, y_pred) plt.plot(fpr, tpr, color=’blue’, lw= , label=’GaussianNB (area = % f)’ % auc_list[ ]) y_pred = bnb.predict_proba(X_test)[:, ] fpr, tpr, thresholds = roc_curve(y_test, y_pred) plt.plot(fpr, tpr, color=’green’, lw= , label=’BernoulliNB (area = % f)’ % auc_list[ ]) y_pred = logit.predict_proba(X_test)[:, ] fpr, tpr, thresholds = roc_curve(y_test, y_pred) plt.plot(fpr, tpr, color=’red’, lw= , label=’LogisticRegression (area = % f)’ % auc_list[ ]) y_pred = knn.predict_proba(X_test)[:, ] fpr, tpr, thresholds = roc_curve(y_test, y_pred) plt.plot(fpr, tpr, color=’yellow’, lw= , label=’KNN (area = % f)’ % auc_list[ ]) y_pred = decisiontree.predict_proba(X_test)[:, ] fpr, tpr, thresholds = roc_curve(y_test, y_pred) plt.plot(fpr, tpr, color=’purple’, lw= , label=’DecisionTree (area = % f)’ % auc_list[ ]) y_pred = randomforest.predict_proba(X_test)[:, ] fpr, tpr, thresholds = roc_curve(y_test, y_pred) plt.plot(fpr, tpr, color=’brown’, lw= , label=’RandomForest (area = % f)’ % auc_list[ ]) y_pred = linearsvc.predict(X_test) fpr, tpr, thresholds = roc_curve(y_test, y_pred) plt.plot(fpr, tpr, color=’cyan’, lw= , label=’LinearSVC (area = % f)’ % auc_list[ ]) y_pred = svc.predict_proba(X_test)[:, ] fpr, tpr, thresholds = roc_curve(y_test, y_pred) plt.plot(fpr, tpr, color=’magenta’, lw= , label=’SVC (area = % f)’ % auc_list[ ]) plt.plot([ , ], [ , ], color=’navy’, lw= , linestyle=’ — ‘) plt.xlim([ , ]) plt.ylim([ , ]) plt.xlabel(‘ Positive Rate’, fontsize= ) plt.ylabel(‘ Positive Rate’, fontsize= ) plt.title(‘Receiver Operating Characteristic’, fontsize= ) plt.legend(loc=’lower right’, fontsize= ) plt.show() # plot the ROC curves 10 10 1 3 0.2 0 1 3 0.2 1 1 2 0.2 2 1 3 0.2 3 1 2 0.2 4 1 2 0.2 5 2 0.2 6 1 2 0.2 7 0 1 0 1 2 0.0 1.0 0.0 1.0 False 13 True 14 17 13 For time taken, Logistic Regression may not be the fastest, but it is definitely not the slowest. SVC is 10 times slower than Random Forest Classifier, which is 10 times slower than Logistic Regression. Another performance metric is where we look for the model with the lowest loss function. Once again Logistic Regression is the best, and thus it is the final selected model. Log-Loss-Score Hyperparameter Tuning Next I tuned the of Logistic Regression to maximise the F1-Score. By tuning this hyperparameter, I have achieved the optimised Recall rate of 76%. is defined as the number of churned customers predicted correctly. However, the trade-off is that only 58% of the churn predictions ( rate) are correct. This is due to the limitation in the current model and dataset. threshold Recall Precision Evaluation on test data Finally, I evaluated the Logistic Regression model on test data. Features are sorted in descending order of importance from the list of 47 features. Depending on the number of features used in the model, the performance scores are different. Notice that there is a diminishing marginal improvement in the F1-Score. Conclusion I would propose to implement the Logistic Regression model with these top 18 features. And business can also focus on this list of features to understand whether a customer will likely to churn or not. codes for the above analysis are available on my , do feel free to refer to them. Python GitHub https://github.com/JNYH/Project-McNulty Video presentation: Thank you for reading!