Visualize Insights and Discover Driving Features in Lending Credit Risk Model for Loan Defaults is the largest online loan marketplace, facilitating personal loans, business loans, and financing of medical procedures. Borrowers can easily access lower interest rate loans through a fast online interface. Lending Club Like most other lending companies, lending loans to ‘risky’ applicants is the largest source of financial loss (called credit loss). The credit loss is the amount of money lost by the lender when the borrower refuses to pay or runs away with the money owed. In other words, borrowers who default cause the largest amount of loss to the lenders. Therefore, using , and public data from , we will be exploring and crunching out the driving factors that exist behind the loan default, i.e. the variables which are strong indicators of default. Further, the company can utilise this knowledge for its portfolio and risk assessment. Data Science Exploratory Data Analysis, Lending Club About Lending Club Loan Dataset The dataset contains complete loan data for all loans issued through the , including the current loan status ( ) and latest payment information. Additional features include credit scores, number of finance inquiries, and collections among others. The file is a matrix of about 39 thousand observations and 111 variables. A is provided in a separate file in the dataset. The dataset can be downloaded here on . 2007–2011 Current, Charged-off, Fully Paid Data Dictionary Kaggle Questions What set of loan data are we working with? What types of do we have? features Do we need to treat ? missing values What is the distribution of Loan Status? What is the distribution of Loan Default with other features? What all plots we can draw for with Loan Default? inferring the relation Majorly, what are the that describe the Loan Default? driving features Feature Distribution such as which shows the information about the loan that will help us in finding loan default. Loan Characteristics loan amount, term, purpose such as which shows the information about the borrower profile which is not useful for us. Demographic Variables age, employment status, relationship status such as which shows the information which is updated after providing the loan which in our case is not useful as we need to decide whether we should approve the loan or not by default analysis. Behavioural Variables next payment date, EMI, delinquency Here is a quick overview of the things we are going to see in this article: (Distribution of Loans) Dataset Overview (Missing Values, Standardize Data, Outlier Treatment) Data Cleaning (Binning) Metrics Derivation (Categorical/Continuous Features) Univariate Analysis (Box Plots, Scatter Plots, Violin Plots) Bivariate Analysis (Correlation Heatmap) Multivariate Analysis Data/Library Imports numpy np print( ,np.__version__) pandas pd print( ,pd.__version__) matplotlib.pyplot plt seaborn sns %matplotlib inline sns.set(style= ) plt.style.use( ) plt.rcParams[ ] = ( , ) pd.options.mode.chained_assignment = pd.options.display.float_format = .format pd.set_option( , ) pd.set_option( , ) case_data = loan = pd.read_csv(case_data, low_memory= ) # import required libraries import as 'numpy version:' import as 'pandas version:' import as import as "whitegrid" 'ggplot' 'figure.figsize' 12 8 None '{:.2f}' 'display.max_columns' 200 'display.width' 400 # file path variable "/kaggle/input/lending-club-loan-dataset-2007-2011/loan.csv" False Data set has 111 columns and 39717 rows. Dataset Overview chargedOffLoans = loan.loc[(loan[ ] == )] currentLoans = loan.loc[(loan[ ] == )] fullyPaidLoans = loan.loc[(loan[ ]== )] data = [{ : chargedOffLoans[ ].sum(), :fullyPaidLoans[ ].sum(), :currentLoans[ ].sum()}] investment_sum = pd.DataFrame(data) chargedOffTotalSum = float(investment_sum[ ]) fullyPaidTotalSum = float(investment_sum[ ]) currentTotalSum = float(investment_sum[ ]) loan_status = [chargedOffTotalSum,fullyPaidTotalSum,currentTotalSum] loan_status_labels = , , plt.pie(loan_status,labels=loan_status_labels,autopct= ) plt.title( ) plt.axis( ) plt.legend(loan_status,title= ,loc= ,bbox_to_anchor=( , , , )) plt.show() # plotting pie chart for different types of loan_status "loan_status" "Charged Off" "loan_status" "Current" "loan_status" "Fully Paid" "Charged Off" "funded_amnt_inv" "Fully Paid" "funded_amnt_inv" "Current" "funded_amnt_inv" "Charged Off" "Fully Paid" "Current" 'Charged Off' 'Fully Paid' 'Current' '%1.1f%%' 'Loan Status Aggregate Information' 'equal' "Loan Amount" "center left" 1 0 0.5 1 loans_purpose = loan.groupby([ ])[ ].sum().reset_index() plt.figure(figsize=( , )) plt.pie(loans_purpose[ ],labels=loans_purpose[ ],autopct= ) plt.title( ) plt.axis( ) plt.legend(loan_status,title= ,loc= ,bbox_to_anchor=( , , , )) plt.show() # plotting pie chart for different types of purpose 'purpose' 'funded_amnt_inv' 14 10 "funded_amnt_inv" "purpose" '%1.1f%%' 'Loan purpose Aggregate Information' 'equal' "Loan purpose" "center left" 1 0 0.5 1 Data Cleaning loan = loan.dropna(axis= , how= ) print( ) print(loan.info(max_cols= )) # in dataset, we can see around half of the columns are null # completely, hence remove all columns having no values 1 "all" "Looking into remaining columns info:" 200 We are left with the following columns: Looking into remaining columns info: < entries, to Data columns (total columns): --- ------ -------------- ----- id non-null int64 member_id non-null int64 loan_amnt non-null int64 funded_amnt non-null int64 funded_amnt_inv non-null float64 term non-null object int_rate non-null object installment non-null float64 grade non-null object sub_grade non-null object emp_title non-null object emp_length non-null object home_ownership non-null object annual_inc non-null float64 verification_status non-null object issue_d non-null object loan_status non-null object pymnt_plan non-null object url non-null object desc non-null object purpose non-null object title non-null object zip_code non-null object addr_state non-null object dti non-null float64 delinq_2yrs non-null int64 earliest_cr_line non-null object inq_last_6mths non-null int64 mths_since_last_delinq non-null float64 mths_since_last_record non-null float64 open_acc non-null int64 pub_rec non-null int64 revol_bal non-null int64 revol_util non-null object total_acc non-null int64 initial_list_status non-null object out_prncp non-null float64 out_prncp_inv non-null float64 total_pymnt non-null float64 total_pymnt_inv non-null float64 total_rec_prncp non-null float64 total_rec_int non-null float64 total_rec_late_fee non-null float64 recoveries non-null float64 collection_recovery_fee non-null float64 last_pymnt_d non-null object last_pymnt_amnt non-null float64 next_pymnt_d non-null object last_credit_pull_d non-null object collections_12_mths_ex_med non-null float64 policy_code non-null int64 application_type non-null object acc_now_delinq non-null int64 chargeoff_within_12_mths non-null float64 delinq_amnt non-null int64 pub_rec_bankruptcies non-null float64 tax_liens non-null float64 dtypes: float64( ), int64( ), object( ) memory usage: + MB ' . . . '> : class pandas core frame DataFrame RangeIndex 39717 0 39716 57 # Column Non-Null Count Dtype 0 39717 1 39717 2 39717 3 39717 4 39717 5 39717 6 39717 7 39717 8 39717 9 39717 10 37258 11 38642 12 39717 13 39717 14 39717 15 39717 16 39717 17 39717 18 39717 19 26777 20 39717 21 39706 22 39717 23 39717 24 39717 25 39717 26 39717 27 39717 28 14035 29 2786 30 39717 31 39717 32 39717 33 39667 34 39717 35 39717 36 39717 37 39717 38 39717 39 39717 40 39717 41 39717 42 39717 43 39717 44 39717 45 39646 46 39717 47 1140 48 39715 49 39661 50 39717 51 39717 52 39717 53 39661 54 39717 55 39020 56 39678 20 13 24 17.3 Now, we will remove all the features which are of no use for default analysis for credit approval. Demographic and Customer Behavioural colsToDrop = [ , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ] loan.drop(colsToDrop, axis= , inplace= ) print( ,list(loan.columns)) # remove non-required columns # id - not required # member_id - not required # acc_now_delinq - empty # funded_amnt - not useful, funded_amnt_inv is useful which is funded to person # emp_title - brand names not useful # pymnt_plan - fixed value as n for all # url - not useful # desc - can be applied some NLP but not for EDA # title - too many distinct values not useful # zip_code - complete zip is not available # delinq_2yrs - post approval feature # mths_since_last_delinq - only half values are there, not much information # mths_since_last_record - only 10% values are there # revol_bal - post/behavioural feature # initial_list_status - fixed value as f for all # out_prncp - post approval feature # out_prncp_inv - not useful as its for investors # total_pymnt - post approval feature # total_pymnt_inv - not useful as it is for investors # total_rec_prncp - post approval feature # total_rec_int - post approval feature # total_rec_late_fee - post approval feature # recoveries - post approval feature # collection_recovery_fee - post approval feature # last_pymnt_d - post approval feature # last_credit_pull_d - irrelevant for approval # last_pymnt_amnt - post feature # next_pymnt_d - post feature # collections_12_mths_ex_med - only 1 value # policy_code - only 1 value # acc_now_delinq - single valued # chargeoff_within_12_mths - post feature # delinq_amnt - single valued # tax_liens - single valued # application_type - single # pub_rec_bankruptcies - single valued for more than 99% # addr_state - may not depend on location as its in financial domain "id" "member_id" "funded_amnt" "emp_title" "pymnt_plan" "url" "desc" "title" "zip_code" "delinq_2yrs" "mths_since_last_delinq" "mths_since_last_record" "revol_bal" "initial_list_status" "out_prncp" "out_prncp_inv" "total_pymnt" "total_pymnt_inv" "total_rec_prncp" "total_rec_int" "total_rec_late_fee" "recoveries" "collection_recovery_fee" "last_pymnt_d" "last_pymnt_amnt" "next_pymnt_d" "last_credit_pull_d" "collections_12_mths_ex_med" "policy_code" "acc_now_delinq" "chargeoff_within_12_mths" "delinq_amnt" "tax_liens" "application_type" "pub_rec_bankruptcies" "addr_state" 1 True "Features we are left with" We are left with [‘loan_amnt’, ‘funded_amnt_inv’, ‘term’, ‘int_rate’, ‘installment’, ‘grade’, ‘sub_grade’, ‘emp_length’, ‘home_ownership’, ‘annual_inc’, ‘verification_status’, ‘issue_d’, ‘loan_status’, ‘purpose’, ‘dti’, ‘earliest_cr_line’, ‘inq_last_6mths’, ‘open_acc’, ‘pub_rec’, ‘revol_util’, ‘total_acc’] Now, dealing with by removing/imputing: missing values loan.dropna(axis= , subset=[ ], inplace= ) loan.dropna(axis= , subset=[ ], inplace= ) # in 12 unique values we have 10+ years the most for emp_length, # but it is highly dependent variable so we will not impute # but remove the rows with null values which is around 2.5% 0 "emp_length" True # remove NA rows for revol_util as its dependent and is around 0.1% 0 "revol_util" True Now, we standardize some feature columns to make data compatible for analysis: loan[ ] = pd.to_numeric(loan[ ].apply( x:x.split( )[ ])) loan[ ] = pd.to_numeric(loan[ ].apply( x:x.split( )[ ])) loan[ ] = pd.to_numeric(loan[ ].apply( x:x.split()[ ])) # update int_rate, revol_util without % sign and as numeric type "int_rate" "int_rate" lambda '%' 0 "revol_util" "revol_util" lambda '%' 0 # remove text data from term feature and store as numerical "term" "term" lambda 0 Removing records with loan status as “Current”, as the loan is currently running and we can’t infer any information regarding default from such loans. loan = loan[loan[ ].apply( x: x == )] loan[ ] = loan[ ].apply( x: x == ) loan[ ] = pd.to_numeric(loan[ ].apply( x: x (x.split( )[ ] x x.split()[ ]))) loan_purpose_values = loan[ ].value_counts()* /loan.shape[ ] loan_purpose_delete = loan_purpose_values[loan_purpose_values< ].index.values loan = loan[[ p loan_purpose_delete p loan[ ]]] # remove the rows with loan_status as "Current" "loan_status" lambda False if "Current" else True # update loan_status as Fully Paid to 0 and Charged Off to 1 "loan_status" "loan_status" lambda 0 if "Fully Paid" else 1 # update emp_length feature with continuous values as int # where (< 1 year) is assumed as 0 and 10+ years is assumed as 10 and rest are stored as their magnitude "emp_length" "emp_length" lambda 0 if "<" in else '+' 0 if "+" in else 0 # look through the purpose value counts "purpose" 100 0 # remove rows with less than 1% of value counts in paricular purpose 1 False if in else True for in "purpose" Outlier Treatment Looking upon the quantile values of each feature, we will treat outliers for some features. annual_inc_q = loan[ ].quantile( ) loan = loan[loan[ ] < annual_inc_q] open_acc_q = loan[ ].quantile( ) loan = loan[loan[ ] < open_acc_q] total_acc_q = loan[ ].quantile( ) loan = loan[loan[ ] < total_acc_q] pub_rec_q = loan[ ].quantile( ) loan = loan[loan[ ] <= pub_rec_q] # for annual_inc, the highest value is 6000000 where 75% quantile value is 83000, and is 100 times the mean # we need to remomve outliers from annual_inc i.e. 99 to 100% "annual_inc" 0.99 "annual_inc" # for open_acc, the highest value is 44 where 75% quantile value is 12, and is 5 times the mean # we need to remomve outliers from open_acc i.e. 99.9 to 100% "open_acc" 0.999 "open_acc" # for total_acc, the highest value is 90 where 75% quantile value is 29, and is 4 times the mean # we need to remomve outliers from total_acc i.e. 98 to 100% "total_acc" 0.98 "total_acc" # for pub_rec, the highest value is 4 where 75% quantile value is 0, and is 4 times the mean # we need to remomve outliers from pub_rec i.e. 99.5 to 100% "pub_rec" 0.995 "pub_rec" Now this is how our data looks after cleaning and standardizing the features: Metrics Derivation Issue date is not in the standard format also we can split the date into two columns with the month and the year which will make it easy for analysis. Year in the DateTime requires year between 00 to 99 and in some cases a year is single-digit number i.e. 9 writing a function which will convert such dates to avoid exception in date conversion. year = date.split( )[ ] (len(year) == ): date = +date date datetime datetime loan[ ] = loan[ ].apply( x:standerdisedate(x)) loan[ ] = loan[ ].apply( x: datetime.strptime(x, )) loan[ ] = loan[ ].apply( x: x.month) loan[ ] = loan[ ].apply( x: x.year) loan[ ] = pd.to_numeric(loan[ ].apply( x:x.split( )[ ])) : def standerdisedate (date) "-" 0 if 1 "0" return from import 'issue_d' 'issue_d' lambda 'issue_d' 'issue_d' lambda '%b-%y' # extracting month and year from issue_date 'month' 'issue_d' lambda 'year' 'issue_d' lambda # get year from issue_d and replace the same "earliest_cr_line" "earliest_cr_line" lambda '-' 1 Binning Continuous features: bins = [ , , , , , , ] bucket_l = [ , , , , , ] loan[ ] = pd.cut(loan[ ], bins, labels=bucket_l) bins = [ , , , , , ] bucket_l = [ , , , , ] loan[ ] = pd.cut(loan[ ], bins, labels=bucket_l) bins = [ , , , , , ] bucket_l = [ , , , , ] loan[ ] = pd.cut(loan[ ], bins, labels=bucket_l) n <= : n > n <= : n > n <= : : loan[ ] = loan[ ].apply( x: installment(x)) bins = [ , , , , , , ] bucket_l = [ , , , , , ] loan[ ] = pd.cut(loan[ ], bins, labels=bucket_l) # create bins for loan_amnt range 0 5000 10000 15000 20000 25000 36000 '0-5000' '5000-10000' '10000-15000' '15000-20000' '20000-25000' '25000+' 'loan_amnt_range' 'loan_amnt' # create bins for int_rate range 0 7.5 10 12.5 15 100 '0-7.5' '7.5-10' '10-12.5' '12.5-15' '15+' 'int_rate_range' 'int_rate' # create bins for annual_inc range 0 25000 50000 75000 100000 1000000 '0-25000' '25000-50000' '50000-75000' '75000-100000' '100000+' 'annual_inc_range' 'annual_inc' # create bins for installment range : def installment (n) if 200 return 'low' elif 200 and 500 return 'medium' elif 500 and 800 return 'high' else return 'very high' 'installment' 'installment' lambda # create bins for dti range -1 5.00 10.00 15.00 20.00 25.00 50.00 '0-5%' '5-10%' '10-15%' '15-20%' '20-25%' '25%+' 'dti_range' 'dti' The following bins are created: Visualising Data Insights plt.figure(figsize=( , )) sns.countplot(y= , data=loan) plt.show() # check for amount of defaults in the data using countplot 14 5 "loan_status" From above plot we can see that around 16% i.e. 5062 people are defaulters in total 35152 records. Univariate Analysis plt.figure(figsize=figsize) rsorted: feature_dimension = sorted(data[feature].unique()) : feature_dimension = data[feature].unique() feature_values = [] fd feature_dimension: feature_filter = data[data[feature]==fd] feature_count = len(feature_filter[feature_filter[ ]== ]) feature_values.append(feature_count* /feature_filter[ ].count()) plt.bar(feature_dimension, feature_values, color= , edgecolor= ) plt.title( +str(feature)+ ) plt.xlabel(feature, fontsize= ) plt.ylabel( , fontsize= ) plt.show() plt.figure(figsize=figsize) sns.barplot(x=x, y= , data=loan) plt.title( +str(x)+ ) plt.xlabel(x, fontsize= ) plt.ylabel( , fontsize= ) plt.show() # function for plotting the count plot features wrt default ratio : def plotUnivariateRatioBar (feature, data=loan, figsize= , rsorted=True) ( , ) 10 5 if else for in "loan_status" 1 100 "loan_status" 'orange' 'white' "Loan Defaults wrt " " feature - countplot" 16 "defaulter %" 16 # function to plot univariate with default status scale 0 - 1 : def plotUnivariateBar (x, figsize= ) ( , ) 10 5 'loan_status' "Loan Defaults wrt " " feature - countplot" 16 "defaulter ratio" 16 a. Categorical Features plotUnivariateBar( , figsize=( , )) # check for defaulters wrt term in the data using countplot "term" 8 5 From above plot for ‘term’ we can infer that the defaulters rate is increasing wrt term, hence the chances of loan getting deaulted is less for 36m than 60m. is term benificial -> Yes plotUnivariateRatioBar( ) # check for defaulters wrt grade in the data using countplot "grade" From above plot for ‘grade’ we can infer that the defaulters rate is increasing wrt grade, hence the chances of loan getting deaulted increases with the grade from A moving towards G. is grade benificial -> Yes plotUnivariateBar( , figsize=( , )) # check for defaulters wrt sub_grade in the data using countplot "sub_grade" 16 5 From above plot for ‘sub_grade’ we can infer that the defaulters rate is increasing wrt sub_grade, hence the chances of loan getting deaulted increases with the sub_grade from A1 moving towards G5. is sub_grade benificial -> Yes plotUnivariateRatioBar( ) # check for defaulters wrt home_ownership in the data "home_ownership" From above plot for ‘home_ownership’ we can infer that the defaulters rate is constant here (it is quite more for OTHERS but we dont know what is in there, so we’ll not consider it for analysis), hence defaulter does not depends on home_ownership is home_ownership benificial -> No plotUnivariateRatioBar( ) # check for defaulters wrt verification_status in the data "verification_status" From above plot for ‘verification_status’ we can infer that the defaulters rate is increasing and is less for Not Verified users than Verified ones, but not useful for analysis. is verification_status benificial -> No plotUnivariateBar( , figsize=( , )) # check for defaulters wrt purpose in the data using countplot "purpose" 16 6 From above plot for ‘purpose’ we can infer that the defaulters rate is nearly constant for all purpose type except ‘small business’, hence rate will depend on purpose of the loan is purpose benificial -> Yes plotUnivariateRatioBar( , figsize=( , )) # check for defaulters wrt open_acc in the data using countplot "open_acc" 16 6 From above plot for ‘open_acc’ we can infer that the defaulters rate is nearly constant for feature open_acc, hence rate will not depend on open_acc feature is open_acc benificial -> No plotUnivariateRatioBar( ) # check for defaulters wrt pub_rec in the data using countplot "pub_rec" From above plot for ‘pub_rec’ we can infer that the defaulters rate is nearly increasing as it is less for 0 and more for pub_rec with value 1, but as other values are very less as compared to 0 we’ll not consider this is pub_rec benificial -> No b. Continuous Features plotUnivariateBar( , figsize=( , )) # check for defaulters wrt emp_length in the data using countplot "emp_length" 14 6 From above plot for ‘emp_length’ we can infer that the defaulters rate is constant here, hence defaulter does not depends on emp_length is emp_length benificial -> No plotUnivariateBar( , figsize=( , )) # check for defaulters wrt month in the data using countplot "month" 14 6 From above plot for ‘month’ we can infer that the defaulters rate is nearly constant here, not useful is month benificial -> No plotUnivariateBar( ) # check for defaulters wrt year in the data using countplot "year" From above plot for ‘year’ we can infer that the defaulters rate is nearly constant here, not useful is year benificial -> No plotUnivariateBar( , figsize=( , )) # check for defaulters wrt earliest_cr_line in the data "earliest_cr_line" 16 10 From above plot for ‘earliest_cr_line’ we can infer that the defaulters rate is nearly constant for all purpose type except year around 65, hence rate does not depends on earliest_cr_line of the person is earliest_cr_line benificial -> No plotUnivariateBar( ) # check for defaulters wrt inq_last_6mths in the data "inq_last_6mths" From above plot for ‘inq_last_6mths’ we can infer that the defaulters rate is not consistently increasing with inq_last_6mths type, hence not useful is inq_last_6mths benificial -> No plotUnivariateRatioBar( , figsize=( , )) # check for defaulters wrt revol_util in the data using countplot "revol_util" 16 6 From above plot for ‘revol_util’ we can infer that the defaulters rate is fluctuating where some have complete 100% ratio for defaulter and is increasing as the magnitude increases, hence rate will depend on revol_util feature is revol_util benificial -> Yes plotUnivariateRatioBar( , figsize=( , )) # check for defaulters wrt total_acc in the data using countplot "total_acc" 14 6 From above plot for ‘total_acc’ we can infer that the defaulters rate is nearly constant for all total_acc values, hence rate will not depend on total_acc feature is total_acc benificial -> No plotUnivariateBar( ) # check for defaulters wrt loan_amnt_range in the data using countplot "loan_amnt_range" From above plot for ‘loan_amnt_range’ we can infer that the defaulters rate is increasing loan_amnt_range values, hence rate will depend on loan_amnt_range feature is loan_amnt_range benificial -> Yes plotUnivariateBar( ) # check for defaulters wrt int_rate_range in the data "int_rate_range" From above plot for ‘int_rate_range’ we can infer that the defaulters rate is decreasing with int_rate_range values, hence rate will depend on int_rate_range feature is int_rate_range benificial -> Yes plotUnivariateBar( ) # check for defaulters wrt annual_inc_range in the data "annual_inc_range" From above plot for ‘annual_inc_range’ we can infer that the defaulters rate is decreasing as with annual_inc_range values, hence rate will depend on annual_inc_range feature is annual_inc_range benificial -> Yes plotUnivariateBar( , figsize=( , )) # check for defaulters wrt dti_range in the data using countplot "dti_range" 16 5 From above plot for ‘dti_range’ we can infer that the defaulters rate is increasing as with dti_range values, hence rate will depend on dti_range feature is dti_range benificial -> Yes plotUnivariateBar( , figsize=( , )) # check for defaulters wrt installment range in the data "installment" 8 5 From above plot for ‘installment’ we can infer that the defaulters rate is increasing as with installment values, hence rate will depend on dti_range feature is installment benificial -> Yes Therefore, following are the important feature we deduced from above Univariate analysis: term, grade, purpose, pub_rec, revol_util, funded_amnt_inv, int_rate, annual_inc, dti, installment Bivariate Analysis plt.figure(figsize=( , )) sns.scatterplot(x=x, y=y, hue= , data=loan) plt.title( +x+ +y) plt.xlabel(x, fontsize= ) plt.ylabel(y, fontsize= ) plt.show() plt.figure(figsize=figsize) sns.barplot(x=x, y= , hue=hue, data=loan) plt.title( +x+ +hue+ ) plt.xlabel(x, fontsize= ) plt.ylabel( , fontsize= ) plt.show() # function to plot scatter plot for two features : def plotScatter (x, y) 16 6 "loan_status" "Scatter plot between " " and " 16 16 : def plotBivariateBar (x, hue, figsize= ) ( , ) 16 6 'loan_status' "Loan Default ratio wrt " " feature for hue " " in the data using countplot" 16 "defaulter ratio" 16 Plotting for two different features with respect to loan default ratio on y-axis with Bar Plots and Scatter Plots. plotBivariateBar( , ) # check for defaulters wrt annual_inc and purpose in the data using countplot "annual_inc_range" "purpose" From above plot, we can infer it doesn’t shows any correlation related - N plotBivariateBar( , ) # check for defaulters wrt term and purpose in the data "term" "purpose" As we can see straight lines on the plot, default ratio increases for every purpose wrt term related - Y plotBivariateBar( , ) # check for defaulters wrt grade and purpose in the data "grade" "purpose" As we can see straight lines on the plot, default ratio increases for every purpose wrt grade related - Y plotBivariateBar( , ) # check for defaulters wrt loan_amnt_range and purpose in the data "loan_amnt_range" "purpose" As we can see straight lines on the plot, default ratio increases for every purpose wrt loan_amnt_range related - Y plotBivariateBar( , ) # check for defaulters wrt loan_amnt_range and term in the data "loan_amnt_range" "term" As we can see straight lines on the plot, default ratio increases for every term wrt loan_amnt_range related - Y plotBivariateBar( , ) # check for defaulters wrt annual_inc_range and purpose in the data "annual_inc_range" "purpose" As we can see straight lines on the plot, default ratio increases for every purpose wrt annual_inc_range related - Y plotBivariateBar( , ) # check for defaulters wrt annual_inc_range and purpose in the data "installment" "purpose" As we can see straight lines on the plot, default ratio increases for every purpose wrt installment except for small_business related - Y plotScatter( , ) # check for defaulters wrt loan_amnt_range in the data "int_rate" "annual_inc" As we can see straight lines on the plot, there is no relation between above mentioned features related - N plotScatter( , ) # plot scatter for funded_amnt_inv with dti "funded_amnt_inv" "dti" As we can see straight lines on the plot, there is no relation between above mentioned features related - N plotScatter( , ) # plot scatter for funded_amnt_inv with annual_inc "annual_inc" "funded_amnt_inv" As we can see slope pattern on the plot, there is positive relation between above mentioned features related - Y plotScatter( , ) # plot scatter for loan_amnt with int_rate "loan_amnt" "int_rate" As we can see straight line patterns on the plot, there is no relation between above mentioned features related - N plotScatter( , ) # plot scatter for int_rate with annual_inc "int_rate" "annual_inc" As we can see negative correlation pattern with reduced density on the plot, there is some relation between above mentioned features related - Y plotScatter( , ) # plot scatter for earliest_cr_line with int_rate "earliest_cr_line" "int_rate" As we can see positive correlation pattern with increasing density on the plot, there is co-relation between above mentioned features related - Y plotScatter( , ) # plot scatter for annual_inc with emp_length "annual_inc" "emp_length" As we can see straight line patterns on the plot, there is no relation between above mentioned features related - N plotScatter( , ) # plot scatter for earliest_cr_line with dti "earliest_cr_line" "dti" Plotting for two different features with respect to loan default ratio on y-axis with Box Plots and Violin Plots. plt.figure(figsize=( , )) sns.boxplot(x=x, y=y, data=loan, hue=hue, order=sorted(loan[x].unique())) plt.title( +x+ +y+ +hue) plt.xlabel(x, fontsize= ) plt.ylabel(y, fontsize= ) plt.show() plt.figure(figsize=( , )) sns.violinplot(x=x, y=y, data=loan, hue=hue, order=sorted(loan[x].unique())) plt.title( +x+ +y+ +hue) plt.xlabel(x, fontsize= ) plt.ylabel(y, fontsize= ) plt.show() plotBox( , ) # function to plot boxplot for comparing two features : def plotBox (x, y, hue= ) "loan_status" 16 6 "Box plot between " " and " " for each " 16 16 16 8 "Violin plot between " " and " " for each " 16 16 # plot box for term vs int_rate for each loan_status "term" "int_rate" int_rate increases with term on loan and the chances of default also increases plotBox( , , hue= ) # plot box for loan_status vs int_rate for each purpose "loan_status" "int_rate" "purpose" int_rate is quite high where the loan defaults for every purpose value plotBox( , ) # plot box for purpose vs revo_util for each status "purpose" "revol_util" revol_util is more for every purpose value where the loan is defaulted and quite high for credit_card plotBox( , , ) # plot box for grade vs int_rate for each loan_status "grade" "int_rate" "loan_status" int_rate is increasing with every grade and also the defaulters for every grade are having their median near the non-defaulter 75% quantile of int_rate plotBox( , , ) # plot box for issue_d vs int_rate for each loan_status "month" "int_rate" "loan_status" int_rate for defaulter is increasing with every month where the defaulters for every month are having their median near the non-defaulter’s 75% quantile of int_rate, but is almost constant for each month, not useful Therefore, following are the important feature we deduced from above Bivariate analysis: term, grade, purpose, pub_rec, revol_util, funded_amnt_inv, int_rate, annual_inc, installment Multivariate Analysis (Correlation) continuous_f = [ , , , , , , , ] loan_corr = loan[continuous_f].corr() sns.heatmap(loan_corr,vmin= ,vmax= ,annot= , cmap= ) plt.title( ) plt.show() # plot heat map to see correlation between features "funded_amnt_inv" "annual_inc" "term" "int_rate" "loan_status" "revol_util" "pub_rec" "earliest_cr_line" -1.0 1.0 True "YlGnBu" "Correlation Heatmap" Hence, important related feature from above are: Multivariate analysis term, grade, purpose, revol_util, int_rate, installment, annual_inc, funded_amnt_inv Final Findings After analysing all the related features available in the dataset, we have come to an end, deducing the main for the analysis: driving features Lending Club Loan Default The best driving features for the Loan default analysis are: term, grade, purpose, revol_util, int_rate, instalment, annual_inc, funded_amnt_inv Also published at https://towardsdatascience.com/insightful-loan-default-analysis-in-lending-credit-risk-model-b16bbfc94a2f