A Step-by-Step Guide (With a Healthy Dose of Data Cleaning) Introduction Data cleaning Building the neural networks Saving the final model Building the API Introduction is the world’s largest peer-to-peer lending platform. Until recently (through the end of 2018), LendingClub published a public dataset of all loans issued since the company’s launch in 2007. I’m accessing the dataset . LendingClub via Kaggle pandas pd loans_raw = pd.read_csv( , low_memory= , ) loans_raw.shape import as "../input/lending-club/accepted_2007_to_2018q4.csv/accepted_2007_to_2018Q4.csv" False ( , ) 2260701 151 With 2,260,701 loans to look at and 151 potential variables, my goal is to create a neural network model to predict the fraction of an expected loan return that a prospective borrower will pay back. Afterward, I’ll create a public API to serve that model. Also, as you may have guessed from the preceding code block, this post is adapted from a Jupyter Notebook. If you’d like to follow along in your own notebook, go ahead and fork mine on or . Kaggle GitHub Data cleaning I’ll first look at the data dictionary (downloaded directly ) to get an idea of how to create the desired output variable and which remaining features are available at the point of loan application (to avoid data leakage). from LendingClub’s website dictionary_df = pd.read_excel( ) dictionary_df.dropna(axis= , inplace= ) dictionary_df = dictionary_df.applymap( x: x.strip()) dictionary_df.set_index( , inplace= ) dictionary = dictionary_df[ ].to_dict() dictionary[ ] = dictionary.pop( ) col loans_raw.columns: print( ) "https://resources.lendingclub.com/LCDataDictionary.xlsx" # Drop blank rows, strip white space, convert to Python dictionary, fix one key name "index" True lambda "LoanStatNew" True "Description" "verification_status_joint" "verified_status_joint" # Print in order of dataset columns (which makes more sense than dictionary's order) for in f"• : " {col} {dictionary[col]} • : A unique LC assigned ID loan listing. •member_id: A unique LC assigned Id borrower member. •loan_amnt: The listed amount loan applied borrower. If point , credit department reduces loan amount, will be reflected this value. •funded_amnt: The total amount committed loan point . •funded_amnt_inv: The total amount committed investors loan point . •term: The payments loan. Values are months can be either •int_rate: Interest Rate loan •installment: The monthly payment owed borrower loan originates. •grade: LC assigned loan grade •sub_grade: LC assigned loan subgrade •emp_title: The job title supplied Borrower when applying loan.* •emp_length: Employment years. Possible values are means one means ten more years. •home_ownership: The home ownership status provided borrower during registration obtained credit report. Our values are: RENT, OWN, MORTGAGE, OTHER •annual_inc: The self-reported annual income provided borrower during registration. •verification_status: Indicates income was verified LC, verified, income source was verified •issue_d: The which loan was funded •loan_status: Current status loan •pymnt_plan: Indicates a payment plan has been place loan •url: URL LC page listing data. •desc: Loan description provided borrower •purpose: A category provided borrower loan request. •title: The loan title provided borrower •zip_code: The numbers zip code provided borrower loan . •addr_state: The state provided borrower loan •dti: A ratio calculated using borrower’s total monthly debt payments total debt obligations, excluding mortgage requested LC loan, divided borrower’s self-reported monthly income. •delinq_2yrs: The + days past-due incidences delinquency borrower's credit past years •earliest_cr_line: The borrower's earliest reported credit line was opened •fico_range_low: The lower boundary range borrower’s FICO loan origination belongs . •fico_range_high: The upper boundary range borrower’s FICO loan origination belongs . •inq_last_6mths: The inquiries past months (excluding auto mortgage inquiries) •mths_since_last_delinq: The months borrower's delinquency. •mths_since_last_record: The months public . •open_acc: The open credit lines borrower's credit . •pub_rec: Number derogatory public records •revol_bal: Total credit revolving balance •revol_util: Revolving line utilization rate, amount credit borrower using relative all available revolving credit. •total_acc: The total credit lines currently borrower's credit •initial_list_status: The initial listing status loan. Possible values are – W, F •out_prncp: Remaining outstanding principal total amount funded •out_prncp_inv: Remaining outstanding principal portion total amount funded investors •total_pymnt: Payments received total amount funded •total_pymnt_inv: Payments received portion total amount funded investors •total_rec_prncp: Principal received •total_rec_int: Interest received •total_rec_late_fee: Late fees received •recoveries: post charge off gross recovery •collection_recovery_fee: post charge off collection fee •last_pymnt_d: Last payment was received •last_pymnt_amnt: Last total payment amount received •next_pymnt_d: Next scheduled payment •last_credit_pull_d: The most recent LC pulled credit this loan •last_fico_range_high: The upper boundary range borrower’s FICO pulled belongs . •last_fico_range_low: The lower boundary range borrower’s FICO pulled belongs . •collections_12_mths_ex_med: Number collections months excluding medical collections •mths_since_last_major_derog: Months most recent - worse rating •policy_code: publicly available policy_code= new products publicly available policy_code= •application_type: Indicates whether loan an individual a joint two co-borrowers •annual_inc_joint: The combined self-reported annual income provided co-borrowers during registration •dti_joint: A ratio calculated using co-borrowers' total monthly payments total debt obligations, excluding mortgages requested LC loan, divided co-borrowers' combined self-reported monthly income •verification_status_joint: Indicates co-borrowers' joint income was verified LC, verified, income source was verified •acc_now_delinq: The accounts which borrower now delinquent. •tot_coll_amt: Total collection amounts ever owed •tot_cur_bal: Total current balance all accounts •open_acc_6m: Number open trades months •open_act_il: Number currently active installment trades •open_il_12m: Number installment accounts opened past months •open_il_24m: Number installment accounts opened past months •mths_since_rcnt_il: Months most recent installment accounts opened •total_bal_il: Total current balance all installment accounts •il_util: Ratio total current balance high credit/credit limit all install acct •open_rv_12m: Number revolving trades opened past months •open_rv_24m: Number revolving trades opened past months •max_bal_bc: Maximum current balance owed all revolving accounts •all_util: Balance credit limit all trades •total_rev_hi_lim: Total revolving high credit/credit limit •inq_fi: Number personal finance inquiries •total_cu_tl: Number finance trades •inq_last_12m: Number credit inquiries past months •acc_open_past_24mths: Number trades opened past months. •avg_cur_bal: Average current balance all accounts •bc_open_to_buy: Total open buy revolving bankcards. •bc_util: Ratio total current balance high credit/credit limit all bankcard accounts. •chargeoff_within_12_mths: Number charge-offs within months •delinq_amnt: The past-due amount owed accounts which borrower now delinquent. •mo_sin_old_il_acct: Months oldest bank installment account opened •mo_sin_old_rev_tl_op: Months oldest revolving account opened •mo_sin_rcnt_rev_tl_op: Months most recent revolving account opened •mo_sin_rcnt_tl: Months most recent account opened •mort_acc: Number mortgage accounts. •mths_since_recent_bc: Months most recent bankcard account opened. •mths_since_recent_bc_dlq: Months most recent bankcard delinquency •mths_since_recent_inq: Months most recent inquiry. •mths_since_recent_revol_delinq: Months most recent revolving delinquency. •num_accts_ever_120_pd: Number accounts ever more days past due •num_actv_bc_tl: Number currently active bankcard accounts •num_actv_rev_tl: Number currently active revolving trades •num_bc_sats: Number satisfactory bankcard accounts •num_bc_tl: Number bankcard accounts •num_il_tl: Number installment accounts •num_op_rev_tl: Number open revolving accounts •num_rev_accts: Number revolving accounts •num_rev_tl_bal_gt_0: Number revolving trades balance > •num_sats: Number satisfactory accounts •num_tl_120dpd_2m: Number accounts currently days past due (updated past months) •num_tl_30dpd: Number accounts currently days past due (updated past months) •num_tl_90g_dpd_24m: Number accounts more days past due months •num_tl_op_past_12m: Number accounts opened past months •pct_tl_nvr_dlq: Percent trades never delinquent •percent_bc_gt_75: Percentage all bankcard accounts > % limit. •pub_rec_bankruptcies: Number public bankruptcies •tax_liens: Number tax liens •tot_hi_cred_lim: Total high credit/credit limit •total_bal_ex_mort: Total credit balance excluding mortgage •total_bc_limit: Total bankcard high credit/credit limit •total_il_high_credit_limit: Total installment high credit/credit limit •revol_bal_joint: Sum revolving credit balance co-borrowers, net duplicate balances •sec_app_fico_range_low: FICO range (high) secondary applicant •sec_app_fico_range_high: FICO range (low) secondary applicant •sec_app_earliest_cr_line: Earliest credit line secondary applicant •sec_app_inq_last_6mths: Credit inquiries months secondary applicant •sec_app_mort_acc: Number mortgage accounts secondary applicant •sec_app_open_acc: Number open trades secondary applicant •sec_app_revol_util: Ratio total current balance high credit/credit limit all revolving accounts •sec_app_open_act_il: Number currently active installment trades secondary applicant •sec_app_num_rev_accts: Number revolving accounts secondary applicant •sec_app_chargeoff_within_12_mths: Number charge-offs within months secondary applicant •sec_app_collections_12_mths_ex_med: Number collections within months excluding medical collections secondary applicant •sec_app_mths_since_last_major_derog: Months most recent - worse rating secondary applicant •hardship_flag: Flags whether borrower a hardship plan •hardship_type: Describes hardship plan offering •hardship_reason: Describes reason hardship plan was offered •hardship_status: Describes hardship plan active, pending, canceled, completed, broken •deferral_term: Amount months borrower expected pay contractual monthly payment amount due a hardship plan •hardship_amount: The interest payment borrower has committed make each they are a hardship plan •hardship_start_date: The start hardship plan period •hardship_end_date: The hardship plan period •payment_plan_start_date: The hardship plan payment due. For example, a borrower has a hardship plan period months, start start three- period which borrower allowed make interest-only payments. •hardship_length: The months borrower will make smaller payments than normally obligated due a hardship plan •hardship_dpd: Account days past due hardship plan start •hardship_loan_status: Loan Status hardship plan start •orig_projected_additional_accrued_interest: The original projected additional interest amount will accrue hardship payment plan Hardship Start Date. This field will be null borrower has broken their hardship payment plan. •hardship_payoff_balance_amount: The payoff balance amount hardship plan start •hardship_last_payment_amount: The payment amount hardship plan start •disbursement_method: The method which borrower receives their loan. Possible values are: CASH, DIRECT_PAY •debt_settlement_flag: Flags whether borrower, who has charged-off, working a debt-settlement company. •debt_settlement_flag_date: The most recent Debt_Settlement_Flag has been •settlement_status: The status borrower’s settlement plan. Possible values are: COMPLETE, ACTIVE, BROKEN, CANCELLED, DENIED, DRAFT •settlement_date: The borrower agrees settlement plan •settlement_amount: The loan amount borrower has agreed settle •settlement_percentage: The settlement amount a percentage payoff balance amount loan •settlement_term: The months borrower will be settlement plan id for the for the of the for by the at some in time the the then it in to that at that in time by for that at that in time number of on the in and 36 or 60. on the by the if the by the for the length in between 0 and 10 where 0 less than year and 10 or by the or from the by the if by not or if the month the of the if put in for the for the with by the by the for the by the first 3 of the by the in the application by the in the application the on the and the by the number of 30 of in the file for the 2 month the the at to the at to number of in 6 and number of since the last number of since the last record number of in the file of or the of the is to number of in the file of the for for of by to date for to date for of by to date to date to date month date month for the last to the last to of in 12 since 90 day or 1 not 2 the is application or application with by the the on the and the by the if the by not or if the number of on the is of of in last 6 of of in 12 of in 24 since of of to on of in 12 of in 24 on to on of of of in 12 of in 24 of to on of to for of 12 for the on the is since since since since of since since since since of 120 or of of of of of of of of with 0 of of 120 in 2 of 30 in 2 of 90 or in last 24 of in 12 of of 75 of of record of of of the of for the for the at time of application for the in the last 6 at time of application for the of at time of application for the of at time of application for the of to for of at time of application for the of at time of application for the of last 12 at time of application for the of last 12 at time of application for the since 90 day or at time of application for the or not the is on the the the if the is or of that the is to less than the to that the to month while on date of the end date of the day the first is if of 3 the date is the of the month in the is to number of the to as of the date as of the date that for the given as of the if the as of the date last as of the date by the or not the is with date that the set of the date that the to the that the to for as of the on the number of that the on the For the output variable (the fraction of expected return that was recovered), I’ll calculated the by multiplying the monthly payment amount ( ) by the number of payments on the loan ( ), and I’ll calculate the by summing the total principle, interest, late fees, and post-chargeoff gross recovery received ( , , , ) and subtracting any collection fee ( ). expected return installment term amount actually received total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee cols_for_output = [ , , , , , , ] "term" "installment" "total_rec_prncp" "total_rec_int" "total_rec_late_fee" "recoveries" "collection_recovery_fee" Several other columns contain either irrelevant demographic data or data not created until after a loan is accepted, so those will need to be removed. I’ll hold onto (the month and year the loan was funded) for now, though, in case I want to compare variables to the date of the loan. issue_d (the applicant’s job title) seem relevant in the context of a loan, but it may have too many unique values to be useful. emp_title does loans_raw[ ].nunique() "emp_title" 512694 Too many unique values indeed. In a future version of this model I could perhaps try to generate a feature from this column by aggregating job titles into categories, but that effort may have a low return on investment, since there are already columns for annual income and length of employment. Two other interesting columns that I’ll also remove are and (“description”), which are both freeform text entries written by the borrower. These could be fascinating subjects for natural language processing, but that’s outside the scope of the current project. Perhaps in the future I could generate additional features from these fields using measures like syntactic complexity, word count, or keyword inclusion. title desc cols_to_drop = [ , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ] loans = loans_raw.drop(columns=cols_to_drop) "id" "member_id" "funded_amnt" "funded_amnt_inv" "int_rate" "grade" "sub_grade" "emp_title" "pymnt_plan" "url" "desc" "title" "zip_code" "addr_state" "initial_list_status" "out_prncp" "out_prncp_inv" "total_pymnt" "total_pymnt_inv" "last_pymnt_d" "last_pymnt_amnt" "next_pymnt_d" "last_credit_pull_d" "last_fico_range_high" "last_fico_range_low" "policy_code" "hardship_flag" "hardship_type" "hardship_reason" "hardship_status" "deferral_term" "hardship_amount" "hardship_start_date" "hardship_end_date" "payment_plan_start_date" "hardship_length" "hardship_dpd" "hardship_loan_status" "orig_projected_additional_accrued_interest" "hardship_payoff_balance_amount" "hardship_last_payment_amount" "disbursement_method" "debt_settlement_flag" "debt_settlement_flag_date" "settlement_status" "settlement_date" "settlement_amount" "settlement_percentage" "settlement_term" Before creating the output variable, however, I must take a closer look at , to see if any loans in the dataset are still open. loan_status loans.groupby( )[ ].count() "loan_status" "loan_status" loan_status Charged Off Current Default Does meet the credit policy. Status:Charged Off Does meet the credit policy. Status:Fully Paid Fully Paid In Grace Period Late ( days) Late ( days) Name: loan_status, dtype: 64 268559 878317 40 not 761 not 1988 1076751 8436 16 -30 4349 31 -120 21467 int For practical purposes, I’ll consider loans with statuses that don’t contain “Fully Paid” or “Charged Off” to still be open, so I’ll remove those from the dataset. I’ll also merge the “credit policy” columns with their matching status. credit_policy = len_credit_policy = len(credit_policy) remove_credit_policy = ( status: status[len_credit_policy:] credit_policy str(status) status ) loans[ ] = loans[ ].map(remove_credit_policy) rows_to_drop = loans[ (loans[ ] != ) & (loans[ ] != ) ].index loans.drop(index=rows_to_drop, inplace= ) loans.groupby( )[ ].count() "Does not meet the credit policy. Status:" lambda if in else "loan_status" "loan_status" "loan_status" "Charged Off" "loan_status" "Fully Paid" True "loan_status" "loan_status" loan_status Charged Off Fully Paid Name: loan_status, dtype: 64 269320 1078739 int Now to create the output variable. I’ll start by checking the null counts of the variables involved. loans[cols_for_output].info() < > Int64Index: entries, Data (total ): # Non- Count Dtype term non- installment non- float64 total_rec_prncp non- float64 total_rec_int non- float64 total_rec_late_fee non- float64 recoveries non- float64 collection_recovery_fee non- float64 dtypes: float64( ), ( ) memory : + MB class 'pandas.core.frame.DataFrame' 1348059 0 to 2260697 columns 7 columns Column Null --- ------ -------------- ----- 0 1348059 null object 1 1348059 null 2 1348059 null 3 1348059 null 4 1348059 null 5 1348059 null 6 1348059 null 6 object 1 usage 82.3 Every remaining row has each of these seven variables, but ’s data type is , so that needs to be fixed first. term object loans.groupby( )[ ].count() "term" "term" term months months Name: term, dtype: 64 36 1023181 60 324878 int Ah, so is a categorical feature with two options. I’ll treat it as such when I use it as an input to the model, but to calculate the output variable I’ll create a numerical column from it. term Also, I need to trim the whitespace from the beginning of those values—that’s no good. onehot_cols = [ ] loans[ ] = loans[ ].map( term_str: term_str.strip()) extract_num = term_str: float(term_str[: ]) loans[ ] = loans[ ].map(extract_num) cols_for_output.remove( ) cols_for_output.append( ) "term" "term" "term" lambda lambda 2 "term_num" "term" "term" "term_num" I can create the output variable. Now received = ( loans[ ] + loans[ ] + loans[ ] + loans[ ] - loans[ ] ) expected = loans[ ] * loans[ ] loans[ ] = received / expected loans.groupby( )[ ].describe() "total_rec_prncp" "total_rec_int" "total_rec_late_fee" "recoveries" "collection_recovery_fee" "installment" "term_num" "fraction_recovered" "loan_status" "fraction_recovered" ┌─────────────┬───────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬───────────┐ │ │ count │ mean │ std │ min │ % │ % │ % │ max │ ├─────────────┼───────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼───────────┤ │ loan_status │ │ │ │ │ │ │ │ │ │ Charged Off │ │ │ │ │ │ │ │ │ │ Fully Paid │ │ │ │ │ │ │ │ │ └─────────────┴───────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴───────────┘ 25 50 75 269320.0 0.400162 0.219020 0.000000 0.224463 0.367554 0.550924 2.410680 1078739.0 0.932705 0.100455 0.506053 0.897912 0.960100 0.997612 60.932353 There is at least one odd outlier on the right in both categories. But also, many of the “fully paid” loans do not quite reach 1. One potential explanation is that when the last payment comes in, the system just flips to “Fully Paid” without adding the payment amount to the system itself, or perhaps simply multiplying by the number leaves off a few cents in the actual total. loan_status installation term If I were performing this analysis for Lending Club themselves, I’d ask them, but this is just a personal project. I’ll consider every loan marked “Fully Paid” to have fully recovered the expected return. For that matter, I’ll cap my values for charged off loans at 1.0 as well, since at least one value is above that for some reason. fraction_recovered numpy np loans[ ] = np.where( (loans[ ] == ) | (loans[ ] > ), , loans[ ], ) loans.groupby( )[ ].describe() import as "fraction_recovered" "loan_status" "Fully Paid" "fraction_recovered" 1.0 1.0 "fraction_recovered" "loan_status" "fraction_recovered" ┌─────────────┬───────────┬──────────┬──────────┬─────┬──────────┬──────────┬──────────┬─────┐ │ │ count │ mean │ std │ min │ % │ % │ % │ max │ ├─────────────┼───────────┼──────────┼──────────┼─────┼──────────┼──────────┼──────────┼─────┤ │ loan_status │ │ │ │ │ │ │ │ │ │ Charged Off │ │ │ │ │ │ │ │ │ │ Fully Paid │ │ │ │ │ │ │ │ │ └─────────────┴───────────┴──────────┴──────────┴─────┴──────────┴──────────┴──────────┴─────┘ 25 50 75 269320.0 0.400152 0.218971 0.0 0.224463 0.367554 0.550924 1.0 1078739.0 1.000000 0.000000 1.0 1.000000 1.000000 1.000000 1.0 For the sake of curiosity, I’ll plot the distribution of fraction recovered for charged-off loans. matplotlib.pyplot plt %matplotlib inline seaborn sns sns.kdeplot( =loans[ ][loans[ ] == ], label= , shade=True, ) plt.axis(xmin= , xmax= ) plt. ( ) plt.show() import as import as data "fraction_recovered" "loan_status" "Charged Off" "Charged Off" 0 1 title 'Distribution of "fraction recovered"' Now that the output is formatted, it’s time to clean up the inputs. I’ll check the null counts of each variable. loans.drop(columns=cols_for_output, inplace= ) loans.info(verbose= , null_counts= ) True True True < : entries, to columns (total columns): # - --- ------ -------------- ----- loan_amnt non- float64 term non- emp_length non- home_ownership non- annual_inc non- float64 verification_status non- issue_d non- loan_status non- purpose non- dti non- float64 delinq_2yrs non- float64 earliest_cr_line non- fico_range_low non- float64 fico_range_high non- float64 inq_last_6mths non- float64 mths_since_last_delinq non- float64 mths_since_last_record non- float64 open_acc non- float64 pub_rec non- float64 revol_bal non- float64 revol_util non- float64 total_acc non- float64 collections_12_mths_ex_med non- float64 mths_since_last_major_derog non- float64 application_type non- annual_inc_joint non- float64 dti_joint non- float64 verification_status_joint non- acc_now_delinq non- float64 tot_coll_amt non- float64 tot_cur_bal non- float64 open_acc_6m non- float64 open_act_il non- float64 open_il_12m non- float64 open_il_24m non- float64 mths_since_rcnt_il non- float64 total_bal_il non- float64 il_util non- float64 open_rv_12m non- float64 open_rv_24m non- float64 max_bal_bc non- float64 all_util non- float64 total_rev_hi_lim non- float64 inq_fi non- float64 total_cu_tl non- float64 inq_last_12m non- float64 acc_open_past_24mths non- float64 avg_cur_bal non- float64 bc_open_to_buy non- float64 bc_util non- float64 chargeoff_within_12_mths non- float64 delinq_amnt non- float64 mo_sin_old_il_acct non- float64 mo_sin_old_rev_tl_op non- float64 mo_sin_rcnt_rev_tl_op non- float64 mo_sin_rcnt_tl non- float64 mort_acc non- float64 mths_since_recent_bc non- float64 mths_since_recent_bc_dlq non- float64 mths_since_recent_inq non- float64 mths_since_recent_revol_delinq non- float64 num_accts_ever_120_pd non- float64 num_actv_bc_tl non- float64 num_actv_rev_tl non- float64 num_bc_sats non- float64 num_bc_tl non- float64 num_il_tl non- float64 num_op_rev_tl non- float64 num_rev_accts non- float64 num_rev_tl_bal_gt_0 non- float64 num_sats non- float64 num_tl_120dpd_2m non- float64 num_tl_30dpd non- float64 num_tl_90g_dpd_24m non- float64 num_tl_op_past_12m non- float64 pct_tl_nvr_dlq non- float64 percent_bc_gt_75 non- float64 pub_rec_bankruptcies non- float64 tax_liens non- float64 tot_hi_cred_lim non- float64 total_bal_ex_mort non- float64 total_bc_limit non- float64 total_il_high_credit_limit non- float64 revol_bal_joint non- float64 sec_app_fico_range_low non- float64 sec_app_fico_range_high non- float64 sec_app_earliest_cr_line non- sec_app_inq_last_6mths non- float64 sec_app_mort_acc non- float64 sec_app_open_acc non- float64 sec_app_revol_util non- float64 sec_app_open_act_il non- float64 sec_app_num_rev_accts non- float64 sec_app_chargeoff_within_12_mths non- float64 sec_app_collections_12_mths_ex_med non- float64 sec_app_mths_since_last_major_derog non- float64 fraction_recovered non- float64 dtypes: float64( ), memory usage: + ' . . . ' class pandas core frame DataFrame > Int64Index 1348059 0 2260697 Data 97 Column Non Null Count Dtype 0 1348059 null 1 1348059 null object 2 1269514 null object 3 1348059 null object 4 1348055 null 5 1348059 null object 6 1348059 null object 7 1348059 null object 8 1348059 null object 9 1347685 null 10 1348030 null 11 1348030 null object 12 1348059 null 13 1348059 null 14 1348029 null 15 668117 null 16 229415 null 17 1348030 null 18 1348030 null 19 1348059 null 20 1347162 null 21 1348030 null 22 1347914 null 23 353750 null 24 1348059 null object 25 25800 null 26 25797 null 27 25595 null object 28 1348030 null 29 1277783 null 30 1277783 null 31 537597 null 32 537598 null 33 537598 null 34 537598 null 35 523382 null 36 537598 null 37 465016 null 38 537598 null 39 537598 null 40 537598 null 41 537545 null 42 1277783 null 43 537598 null 44 537597 null 45 537597 null 46 1298029 null 47 1277761 null 48 1284167 null 49 1283398 null 50 1347914 null 51 1348030 null 52 1239735 null 53 1277782 null 54 1277782 null 55 1277783 null 56 1298029 null 57 1285089 null 58 319020 null 59 1171239 null 60 449962 null 61 1277783 null 62 1277783 null 63 1277783 null 64 1289469 null 65 1277783 null 66 1277783 null 67 1277783 null 68 1277782 null 69 1277783 null 70 1289469 null 71 1227909 null 72 1277783 null 73 1277783 null 74 1277783 null 75 1277629 null 76 1283755 null 77 1346694 null 78 1347954 null 79 1277783 null 80 1298029 null 81 1298029 null 82 1277783 null 83 18629 null 84 18630 null 85 18630 null 86 18630 null object 87 18630 null 88 18630 null 89 18630 null 90 18302 null 91 18630 null 92 18630 null 93 18630 null 94 18630 null 95 6645 null 96 1348059 null 86 ( ) object 11 1007.9 MB Remaining columns with lots of null values seem to fall into three categories: (where null means the borrower doesn’t have any such marks). I’ll also add to this list, since its non-null count is below what seems to be the threshold for complete data, which is around 1,277,783. I’ll assume a null value here means no recent inquiries. Derogatory/delinquency metrics mths_since_recent_inq (where null means it was a single application). Metrics that only apply to joint applications that only have around 537,000 entries. Are these newer metrics? An inexplicable series of 14 credit history–related columns negative_mark_cols = [ , , , , , , , ] joint_cols = [ , , , , , , , , , , , , , , , ] confusing_cols = [ , , , , , , , , , , , , , ] "mths_since_last_delinq" "mths_since_last_record" "mths_since_last_major_derog" "mths_since_recent_bc_dlq" "mths_since_recent_inq" "mths_since_recent_revol_delinq" "mths_since_recent_revol_delinq" "sec_app_mths_since_last_major_derog" "annual_inc_joint" "dti_joint" "verification_status_joint" "revol_bal_joint" "sec_app_fico_range_low" "sec_app_fico_range_high" "sec_app_earliest_cr_line" "sec_app_inq_last_6mths" "sec_app_mort_acc" "sec_app_open_acc" "sec_app_revol_util" "sec_app_open_act_il" "sec_app_num_rev_accts" "sec_app_chargeoff_within_12_mths" "sec_app_collections_12_mths_ex_med" "sec_app_mths_since_last_major_derog" "open_acc_6m" "open_act_il" "open_il_12m" "open_il_24m" "mths_since_rcnt_il" "total_bal_il" "il_util" "open_rv_12m" "open_rv_24m" "max_bal_bc" "all_util" "inq_fi" "total_cu_tl" "inq_last_12m" I’ll first look at those more confusing columns to find out whether or not they’re a newer set of metrics. That’ll require converting to date format first. issue_d loans[ ] = loans[ ].astype( ) loans[confusing_cols + [ ]].dropna(axis= )[ ].agg( [ , , ] ) "issue_d" "issue_d" "datetime64[ns]" # Check date range of confusing columns "issue_d" "index" "issue_d" "count" "min" "max" count min : : max : : Name: issue_d, dtype: object 464325 2015 -12 -01 00 00 00 2018 -12 -01 00 00 00 loans[ ][loans[ ] >= np.datetime64( )].agg( [ , , ] ) # Compare to all entries from Dec 2015 onward "issue_d" "issue_d" "2015-12-01" "count" "min" "max" count min : : max : : Name: issue_d, dtype: object 557708 2015 -12 -01 00 00 00 2018 -12 -01 00 00 00 It appears that these are indeed newer metrics, their use only beginning in December 2015, but even after that point usage is spotty. I’m curious to see if these additional metrics would make a model more accurate, though, so once I’m done cleaning the data I’ll copy the rows with these new metrics into a new dataset and create another model using the new metrics. new_metric_cols = confusing_cols As for the derogatory/delinquency metrics, taking a cue , I’m going to take the inverse of all the “months since recent/last” fields, which will turn each into a proxy for the frequency of the event and also let me set all the null values (when an event has never happened) to 0. For the “months since oldest” fields, I’ll just set the null values to 0 and leave the rest untouched. from Michael Wurm mths_since_last_cols = [ col_name col_name loans.columns col_name col_name ] mths_since_old_cols = [ col_name col_name loans.columns col_name ] col_name mths_since_last_cols: loans[col_name] = [ pd.isna(months) / months == / months months loans[col_name] ] loans.loc[:, mths_since_old_cols].fillna( , inplace= ) rename_mapper = {} col_name mths_since_last_cols: rename_mapper[col_name] = col_name.replace( , ).replace( , ) loans.rename(columns=rename_mapper, inplace= ) i = l.index(old_value) l.pop(i) l.insert(i, new_value) replace_list_value(new_metric_cols, , ) replace_list_value( joint_cols, , , ) for in if "mths_since" in or "mo_sin_rcnt" in for in if "mo_sin_old" in for in 0.0 if else 1 1 if 0 else 1 for in 0 True # Rename inverse columns for in "mths_since" "inv_mths_since" "mo_sin_rcnt" "inv_mo_sin_rcnt" True : def replace_list_value (l, old_value, new_value) "mths_since_rcnt_il" "inv_mths_since_rcnt_il" "sec_app_mths_since_last_major_derog" "sec_app_inv_mths_since_last_major_derog" Now to look closer at joint loans. loans.groupby( )[ ].count() "application_type" "application_type" application_type Individual Joint App Name: application_type, dtype: 64 1322259 25800 int joint_loans = loans[:][loans[ ] == ] joint_loans[joint_cols].info() "application_type" "Joint App" < > Int64Index: entries, Data (total ): # Non- Count Dtype annual_inc_joint non- float64 dti_joint non- float64 verification_status_joint non- revol_bal_joint non- float64 sec_app_fico_range_low non- float64 sec_app_fico_range_high non- float64 sec_app_earliest_cr_line non- sec_app_inq_last_6mths non- float64 sec_app_mort_acc non- float64 sec_app_open_acc non- float64 sec_app_revol_util non- float64 sec_app_open_act_il non- float64 sec_app_num_rev_accts non- float64 sec_app_chargeoff_within_12_mths non- float64 sec_app_collections_12_mths_ex_med non- float64 sec_app_inv_mths_since_last_major_derog non- float64 dtypes: float64( ), ( ) memory : + MB class 'pandas.core.frame.DataFrame' 25800 2 to 2260663 columns 16 columns Column Null --- ------ -------------- ----- 0 25800 null 1 25797 null 2 25595 null object 3 18629 null 4 18630 null 5 18630 null 6 18630 null object 7 18630 null 8 18630 null 9 18630 null 10 18302 null 11 18630 null 12 18630 null 13 18630 null 14 18630 null 15 25800 null 14 object 2 usage 3.3 It seems there may be a case of newer metrics for joint applications as well. I’ll investigate. joint_new_metric_cols = [ , , , , , , , , , , , , ] joint_loans[joint_new_metric_cols + [ ]].dropna(axis= )[ ].agg( [ , , ] ) "revol_bal_joint" "sec_app_fico_range_low" "sec_app_fico_range_high" "sec_app_earliest_cr_line" "sec_app_inq_last_6mths" "sec_app_mort_acc" "sec_app_open_acc" "sec_app_revol_util" "sec_app_open_act_il" "sec_app_num_rev_accts" "sec_app_chargeoff_within_12_mths" "sec_app_collections_12_mths_ex_med" "sec_app_inv_mths_since_last_major_derog" "issue_d" "index" "issue_d" "count" "min" "max" count min : : max : : Name: issue_d, dtype: object 18301 2017 -03 -01 00 00 00 2018 -12 -01 00 00 00 joint_new_metric_cols_2 = [ , , , , , , , , , , , ] joint_loans[joint_new_metric_cols_2 + [ ]].dropna(axis= )[ ].agg( [ , , ] ) # Check without `sec_app_revol_util` column "revol_bal_joint" "sec_app_fico_range_low" "sec_app_fico_range_high" "sec_app_earliest_cr_line" "sec_app_inq_last_6mths" "sec_app_mort_acc" "sec_app_open_acc" "sec_app_open_act_il" "sec_app_num_rev_accts" "sec_app_chargeoff_within_12_mths" "sec_app_collections_12_mths_ex_med" "sec_app_inv_mths_since_last_major_derog" "issue_d" "index" "issue_d" "count" "min" "max" count min : : max : : Name: issue_d, dtype: object 18629 2017 -03 -01 00 00 00 2018 -12 -01 00 00 00 Newer than the previous set of new metrics, even—these didn’t start getting used till March 2017. Now I wonder when joint loans were first introduced. joint_loans[ ].agg([ , , ]) "issue_d" "count" "min" "max" count min : : max : : Name: issue_d, dtype: object 25800 2015 -10 -01 00 00 00 2018 -12 -01 00 00 00 2015. I think I’ll save the newer joint metrics for perhaps a third model, but I believe I can include , , and in the main model—I’ll just binary-encode , and for individual applications I’ll set , , and equal to their non-joint counterparts. annual_inc_joint dti_joint verification_status_joint application_type annual_inc_joint dti_joint verification_status_joint onehot_cols.append( ) joint_col, indiv_col zip( [ , , ], [ , , ], ): loans[joint_col] = [ joint_val app_type == indiv_val app_type, joint_val, indiv_val zip( loans[ ], loans[joint_col], loans[indiv_col] ) ] loans.info(verbose= , null_counts= ) "application_type" # Fill joint columns in individual applications for in "annual_inc_joint" "dti_joint" "verification_status_joint" "annual_inc" "dti" "verification_status" if "Joint App" else for in "application_type" True True < : entries, to columns (total columns): # - --- ------ -------------- ----- loan_amnt non- float64 term non- emp_length non- home_ownership non- annual_inc non- float64 verification_status non- issue_d non- datetime64[ns] loan_status non- purpose non- dti non- float64 delinq_2yrs non- float64 earliest_cr_line non- fico_range_low non- float64 fico_range_high non- float64 inq_last_6mths non- float64 inv_mths_since_last_delinq non- float64 inv_mths_since_last_record non- float64 open_acc non- float64 pub_rec non- float64 revol_bal non- float64 revol_util non- float64 total_acc non- float64 collections_12_mths_ex_med non- float64 inv_mths_since_last_major_derog non- float64 application_type non- annual_inc_joint non- float64 dti_joint non- float64 verification_status_joint non- acc_now_delinq non- float64 tot_coll_amt non- float64 tot_cur_bal non- float64 open_acc_6m non- float64 open_act_il non- float64 open_il_12m non- float64 open_il_24m non- float64 inv_mths_since_rcnt_il non- float64 total_bal_il non- float64 il_util non- float64 open_rv_12m non- float64 open_rv_24m non- float64 max_bal_bc non- float64 all_util non- float64 total_rev_hi_lim non- float64 inq_fi non- float64 total_cu_tl non- float64 inq_last_12m non- float64 acc_open_past_24mths non- float64 avg_cur_bal non- float64 bc_open_to_buy non- float64 bc_util non- float64 chargeoff_within_12_mths non- float64 delinq_amnt non- float64 mo_sin_old_il_acct non- float64 mo_sin_old_rev_tl_op non- float64 inv_mo_sin_rcnt_rev_tl_op non- float64 inv_mo_sin_rcnt_tl non- float64 mort_acc non- float64 inv_mths_since_recent_bc non- float64 inv_mths_since_recent_bc_dlq non- float64 inv_mths_since_recent_inq non- float64 inv_mths_since_recent_revol_delinq non- float64 num_accts_ever_120_pd non- float64 num_actv_bc_tl non- float64 num_actv_rev_tl non- float64 num_bc_sats non- float64 num_bc_tl non- float64 num_il_tl non- float64 num_op_rev_tl non- float64 num_rev_accts non- float64 num_rev_tl_bal_gt_0 non- float64 num_sats non- float64 num_tl_120dpd_2m non- float64 num_tl_30dpd non- float64 num_tl_90g_dpd_24m non- float64 num_tl_op_past_12m non- float64 pct_tl_nvr_dlq non- float64 percent_bc_gt_75 non- float64 pub_rec_bankruptcies non- float64 tax_liens non- float64 tot_hi_cred_lim non- float64 total_bal_ex_mort non- float64 total_bc_limit non- float64 total_il_high_credit_limit non- float64 revol_bal_joint non- float64 sec_app_fico_range_low non- float64 sec_app_fico_range_high non- float64 sec_app_earliest_cr_line non- sec_app_inq_last_6mths non- float64 sec_app_mort_acc non- float64 sec_app_open_acc non- float64 sec_app_revol_util non- float64 sec_app_open_act_il non- float64 sec_app_num_rev_accts non- float64 sec_app_chargeoff_within_12_mths non- float64 sec_app_collections_12_mths_ex_med non- float64 sec_app_inv_mths_since_last_major_derog non- float64 fraction_recovered non- float64 dtypes: datetime64[ns]( ), float64( ), memory usage: + ' . . . ' class pandas core frame DataFrame > Int64Index 1348059 0 2260697 Data 97 Column Non Null Count Dtype 0 1348059 null 1 1348059 null object 2 1269514 null object 3 1348059 null object 4 1348055 null 5 1348059 null object 6 1348059 null 7 1348059 null object 8 1348059 null object 9 1347685 null 10 1348030 null 11 1348030 null object 12 1348059 null 13 1348059 null 14 1348029 null 15 1348059 null 16 1348059 null 17 1348030 null 18 1348030 null 19 1348059 null 20 1347162 null 21 1348030 null 22 1347914 null 23 1348059 null 24 1348059 null object 25 1348055 null 26 1348056 null 27 1347854 null object 28 1348030 null 29 1277783 null 30 1277783 null 31 537597 null 32 537598 null 33 537598 null 34 537598 null 35 1348059 null 36 537598 null 37 465016 null 38 537598 null 39 537598 null 40 537598 null 41 537545 null 42 1277783 null 43 537598 null 44 537597 null 45 537597 null 46 1298029 null 47 1277761 null 48 1284167 null 49 1283398 null 50 1347914 null 51 1348030 null 52 1239735 null 53 1277782 null 54 1348059 null 55 1348059 null 56 1298029 null 57 1348059 null 58 1348059 null 59 1348059 null 60 1348059 null 61 1277783 null 62 1277783 null 63 1277783 null 64 1289469 null 65 1277783 null 66 1277783 null 67 1277783 null 68 1277782 null 69 1277783 null 70 1289469 null 71 1227909 null 72 1277783 null 73 1277783 null 74 1277783 null 75 1277629 null 76 1283755 null 77 1346694 null 78 1347954 null 79 1277783 null 80 1298029 null 81 1298029 null 82 1277783 null 83 18629 null 84 18630 null 85 18630 null 86 18630 null object 87 18630 null 88 18630 null 89 18630 null 90 18302 null 91 18630 null 92 18630 null 93 18630 null 94 18630 null 95 1348059 null 96 1348059 null 1 86 ( ) object 10 1007.9 MB Now the only remaining steps should be removing rows with null values (in columns that aren’t new metrics) and encoding categorical features. I’m rows with null values in those columns because that should still leave the vast majority of rows intact, over 1 million, which is still plenty of data. But I guess I should make sure before I overwrite . removing loans cols_to_search = [ col col loans.columns col new_metric_cols + joint_new_metric_cols ] loans.dropna(axis= , subset=cols_to_search).shape for in if not in "index" ( , ) 1110171 97 Yes, still 1,110,171. That’ll do. loans.dropna(axis= , subset=cols_to_search, inplace= ) "index" True Then actually I’ll tackle and its joint counterpart first before looking at the categorical features. earliest_cr_line loans[[ , ]] "earliest_cr_line" "sec_app_earliest_cr_line" ┌─────────┬──────────────────┬──────────────────────────┐ │ │ earliest_cr_line │ sec_app_earliest_cr_line │ ├─────────┼──────────────────┼──────────────────────────┤ │ │ Aug │ NaN │ │ │ Dec │ NaN │ │ │ Aug │ NaN │ │ │ Jun │ NaN │ │ │ Oct │ NaN │ │ ... │ ... │ ... │ │ │ Jul │ NaN │ │ │ Mar │ NaN │ │ │ Jun │ NaN │ │ │ Aug │ NaN │ │ │ Jul │ NaN │ └─────────┴──────────────────┴──────────────────────────┘ rows × columns 0 -2003 1 -1999 2 -2000 4 -1998 5 -1987 2260688 -2004 2260690 -2002 2260691 -2011 2260692 -1997 2260697 -1999 1110171 2 I should convert that to the age of the credit line at the time of application (or the time of loan issuing, more precisely). earliest_cr_line_date = loans[col_name].astype( ) cr_hist_age_delta = loans[ ] - earliest_cr_line_date MINUTES_PER_MONTH = int( / * * ) cr_hist_age_months = cr_hist_age_delta / np.timedelta64(MINUTES_PER_MONTH, ) cr_hist_age_months.map( value: np.nan pd.isna(value) round(value) ) cr_hist_age_months = get_credit_history_age( ) cr_hist_age_months : def get_credit_history_age (col_name) "datetime64[ns]" "issue_d" 365.25 12 24 60 "m" return lambda if else "earliest_cr_line" ... Length: , dtype: 64 0 148 1 192 2 184 4 210 5 338 2260688 147 2260690 175 2260691 64 2260692 230 2260697 207 1110171 int loans[ ] = cr_hist_age_months loans[ ] = get_credit_history_age( ).astype( ) loans.rename( columns={ : , : , }, inplace= , ) replace_list_value( joint_new_metric_cols, , ) "earliest_cr_line" "sec_app_earliest_cr_line" "sec_app_earliest_cr_line" "Int64" "earliest_cr_line" "cr_hist_age_mths" "sec_app_earliest_cr_line" "sec_app_cr_hist_age_mths" True "sec_app_earliest_cr_line" "sec_app_cr_hist_age_mths" a look at those categorical features. Now categorical_cols = [ , , , , , ] i, col_name enumerate(categorical_cols): print( loans.groupby(col_name)[col_name].count(), i < len(categorical_cols) - , ) "term" "emp_length" "home_ownership" "verification_status" "purpose" "verification_status_joint" for in "\n" if 1 else "" term months months Name: term, dtype: 64 emp_length year + years years years years years years years years years < year Name: emp_length, dtype: 64 home_ownership ANY MORTGAGE NONE OTHER OWN RENT Name: home_ownership, dtype: 64 verification_status Not Verified Source Verified Verified Name: verification_status, dtype: 64 purpose car credit_card debt_consolidation educational home_improvement house major_purchase medical moving other renewable_energy small_business vacation wedding Name: purpose, dtype: 64 verification_status_joint Not Verified Source Verified Verified Name: verification_status_joint, dtype: 64 36 831601 60 278570 int 1 76868 10 392883 2 106124 3 93784 4 69031 5 72421 6 54240 7 52229 8 53826 9 45210 1 93555 int 250 559035 39 40 114577 436230 int 335350 463153 311668 int 10754 245942 653222 1 71089 5720 22901 12302 7464 60986 691 11137 7169 793 int 341073 461941 307157 int First, in researching income verification, I learned that LendingClub only tries to on a subset of loan applications based on the content of the application, so this feature is a source of target leakage. I’ll remove the two offending columns (and a couple more I don’t need anymore). verify income loans.drop( columns=[ , , , , ], inplace= , ) "verification_status" "verification_status_joint" "issue_d" "loan_status" True Once I create my pipeline, I’ll binary encode , one-hot encode and , and since is an ordinal variable, I’ll convert it to the integers 0–10. term home_ownership purpose emp_length onehot_cols += [ , ] ordinal_cols = { : [ , , , , , , , , , , , ] } "home_ownership" "purpose" "emp_length" "< 1 year" "1 year" "2 years" "3 years" "4 years" "5 years" "6 years" "7 years" "8 years" "9 years" "10+ years" That should cover all the cleaning necessary for the first model’s data. I’ll save the columns that’ll be used in the first model to a new DataFrame, and while I’m at it, I’ll start formatting the DataFrames for the two additional models adding the two sets of new metrics. loans_1 = loans.drop(columns=new_metric_cols + joint_new_metric_cols) loans_2 = loans.drop(columns=joint_new_metric_cols) loans_2.info(verbose= , null_counts= ) True True < > Int64Index: entries, Data (total ): # Non- Count Dtype loan_amnt non- float64 term non- emp_length non- home_ownership non- annual_inc non- float64 purpose non- dti non- float64 delinq_2yrs non- float64 cr_hist_age_mths non- int64 fico_range_low non- float64 fico_range_high non- float64 inq_last_6mths non- float64 inv_mths_since_last_delinq non- float64 inv_mths_since_last_record non- float64 open_acc non- float64 pub_rec non- float64 revol_bal non- float64 revol_util non- float64 total_acc non- float64 collections_12_mths_ex_med non- float64 inv_mths_since_last_major_derog non- float64 application_type non- annual_inc_joint non- float64 dti_joint non- float64 acc_now_delinq non- float64 tot_coll_amt non- float64 tot_cur_bal non- float64 open_acc_6m non- float64 open_act_il non- float64 open_il_12m non- float64 open_il_24m non- float64 inv_mths_since_rcnt_il non- float64 total_bal_il non- float64 il_util non- float64 open_rv_12m non- float64 open_rv_24m non- float64 max_bal_bc non- float64 all_util non- float64 total_rev_hi_lim non- float64 inq_fi non- float64 total_cu_tl non- float64 inq_last_12m non- float64 acc_open_past_24mths non- float64 avg_cur_bal non- float64 bc_open_to_buy non- float64 bc_util non- float64 chargeoff_within_12_mths non- float64 delinq_amnt non- float64 mo_sin_old_il_acct non- float64 mo_sin_old_rev_tl_op non- float64 inv_mo_sin_rcnt_rev_tl_op non- float64 inv_mo_sin_rcnt_tl non- float64 mort_acc non- float64 inv_mths_since_recent_bc non- float64 inv_mths_since_recent_bc_dlq non- float64 inv_mths_since_recent_inq non- float64 inv_mths_since_recent_revol_delinq non- float64 num_accts_ever_120_pd non- float64 num_actv_bc_tl non- float64 num_actv_rev_tl non- float64 num_bc_sats non- float64 num_bc_tl non- float64 num_il_tl non- float64 num_op_rev_tl non- float64 num_rev_accts non- float64 num_rev_tl_bal_gt_0 non- float64 num_sats non- float64 num_tl_120dpd_2m non- float64 num_tl_30dpd non- float64 num_tl_90g_dpd_24m non- float64 num_tl_op_past_12m non- float64 pct_tl_nvr_dlq non- float64 percent_bc_gt_75 non- float64 pub_rec_bankruptcies non- float64 tax_liens non- float64 tot_hi_cred_lim non- float64 total_bal_ex_mort non- float64 total_bc_limit non- float64 total_il_high_credit_limit non- float64 fraction_recovered non- float64 dtypes: float64( ), int64( ), ( ) memory : + MB class 'pandas.core.frame.DataFrame' 1110171 0 to 2260697 columns 80 columns Column Null --- ------ -------------- ----- 0 1110171 null 1 1110171 null object 2 1110171 null object 3 1110171 null object 4 1110171 null 5 1110171 null object 6 1110171 null 7 1110171 null 8 1110171 null 9 1110171 null 10 1110171 null 11 1110171 null 12 1110171 null 13 1110171 null 14 1110171 null 15 1110171 null 16 1110171 null 17 1110171 null 18 1110171 null 19 1110171 null 20 1110171 null 21 1110171 null object 22 1110171 null 23 1110171 null 24 1110171 null 25 1110171 null 26 1110171 null 27 459541 null 28 459541 null 29 459541 null 30 459541 null 31 1110171 null 32 459541 null 33 408722 null 34 459541 null 35 459541 null 36 459541 null 37 459541 null 38 1110171 null 39 459541 null 40 459541 null 41 459541 null 42 1110171 null 43 1110171 null 44 1110171 null 45 1110171 null 46 1110171 null 47 1110171 null 48 1110171 null 49 1110171 null 50 1110171 null 51 1110171 null 52 1110171 null 53 1110171 null 54 1110171 null 55 1110171 null 56 1110171 null 57 1110171 null 58 1110171 null 59 1110171 null 60 1110171 null 61 1110171 null 62 1110171 null 63 1110171 null 64 1110171 null 65 1110171 null 66 1110171 null 67 1110171 null 68 1110171 null 69 1110171 null 70 1110171 null 71 1110171 null 72 1110171 null 73 1110171 null 74 1110171 null 75 1110171 null 76 1110171 null 77 1110171 null 78 1110171 null 79 1110171 null 74 1 object 5 usage 686.1 Before I drop a bunch of rows with nulls from , I’m concerned about , as it’s missing values in about 50,000 more rows than the rest of the new metric columns. Why would that be? loans_2 il_util loans_2[ ][loans_2[ ].notna()].describe() "il_util" "il_util" count mean std min % % % max Name: il_util, dtype: 64 408722.000000 71.832894 22.311439 0.000000 25 59.000000 50 75.000000 75 87.000000 464.000000 float Peeking back up to the data dictionary, is the “ratio of total current balance to high credit/credit limit on all install acct”. The relevant balance ( ) and credit limit ( ) metrics appear to already be in the data, so perhaps this utilization metric doesn’t contain any new information. I’ll compare (where it’s present) to the ratio of the other two variables. il_util total_bal_il total_il_high_credit_limit il_util query_df = loans[[ , , ]].dropna( axis= , subset=[ ] ) query_df[ ] = ( query_df[ ] / query_df[ ] ).map( x: float(round(x * ))) query_df[[ , ]] "il_util" "total_bal_il" "total_il_high_credit_limit" "index" "il_util" "il_util_compute" "total_bal_il" "total_il_high_credit_limit" lambda 100 "il_util" "il_util_compute" ┌─────────┬─────────┬─────────────────┐ │ │ il_util │ il_util_compute │ ├─────────┼─────────┼─────────────────┤ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ... │ ... │ ... │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ └─────────┴─────────┴─────────────────┘ rows × columns 0 36.0 36.0 1 73.0 73.0 2 73.0 73.0 4 84.0 84.0 5 99.0 99.0 2260688 52.0 39.0 2260690 50.0 74.0 2260691 47.0 47.0 2260692 79.0 79.0 2260697 78.0 76.0 408722 2 (query_df[ ] == query_df[ ]).describe() "il_util" "il_util_compute" : count 408722 unique 2 top True freq 307589 dtype object query_df[ ] = abs(query_df[ ] - query_df[ ]) query_df[ ][query_df[ ] != ].describe() "compute_diff" "il_util" "il_util_compute" "compute_diff" "compute_diff" 0 count mean std min % % % max Name: compute_diff, dtype: 64 101133.000000 14.638684 16.409913 1.000000 25 3.000000 50 10.000000 75 21.000000 1108.000000 float That’s weird. is equal to the computed ratio three-quarters of the time, but when it’s off, the median difference is 10 points off. Perhaps there’s new information there sometimes after all. Maybe whatever credit bureau is reporting the utilization rate uses a different formula than just a simple ratio? il_util Again, something I could ask if I were performing this analysis for a client, but that’s not the case. I’ll assume that this variable is still valuable, and where is null I’ll impute the value to make it equal to the ratio of to (or 0 if the limit is 0). And I’ll add one more boolean field to mark the imputed entries. il_util total_bal_il total_il_high_credit_limit Also, that 1,108 is a doozy of an outlier, but I think I’ll just leave it be, as it appears that if the neural network architecture is sufficiently deep. outliers aren’t too big a deal loans[ ] = [ pd.isna(util) & pd.notna(bal) & pd.notna(limit) util, bal, limit zip( loans[ ], loans[ ], loans[ ] ) ] new_metric_onehot_cols = [ ] loans[ ] = [ pd.isna(util) & pd.notna(bal) & (limit == ) float(round(bal / limit * )) pd.isna(util) & pd.notna(bal) & pd.notna(limit) util util, bal, limit zip( loans[ ], loans[ ], loans[ ] ) ] loans_2 = loans.drop(columns=joint_new_metric_cols) loans_2.info(verbose= , null_counts= ) "il_util_imputed" True if else False for in "il_util" "total_bal_il" "total_il_high_credit_limit" "il_util_imputed" "il_util" 0.0 if 0 else 100 if else for in "il_util" "total_bal_il" "total_il_high_credit_limit" True True < > Int64Index: entries, Data (total ): # Non- Count Dtype loan_amnt non- float64 term non- emp_length non- home_ownership non- annual_inc non- float64 purpose non- dti non- float64 delinq_2yrs non- float64 cr_hist_age_mths non- int64 fico_range_low non- float64 fico_range_high non- float64 inq_last_6mths non- float64 inv_mths_since_last_delinq non- float64 inv_mths_since_last_record non- float64 open_acc non- float64 pub_rec non- float64 revol_bal non- float64 revol_util non- float64 total_acc non- float64 collections_12_mths_ex_med non- float64 inv_mths_since_last_major_derog non- float64 application_type non- annual_inc_joint non- float64 dti_joint non- float64 acc_now_delinq non- float64 tot_coll_amt non- float64 tot_cur_bal non- float64 open_acc_6m non- float64 open_act_il non- float64 open_il_12m non- float64 open_il_24m non- float64 inv_mths_since_rcnt_il non- float64 total_bal_il non- float64 il_util non- float64 open_rv_12m non- float64 open_rv_24m non- float64 max_bal_bc non- float64 all_util non- float64 total_rev_hi_lim non- float64 inq_fi non- float64 total_cu_tl non- float64 inq_last_12m non- float64 acc_open_past_24mths non- float64 avg_cur_bal non- float64 bc_open_to_buy non- float64 bc_util non- float64 chargeoff_within_12_mths non- float64 delinq_amnt non- float64 mo_sin_old_il_acct non- float64 mo_sin_old_rev_tl_op non- float64 inv_mo_sin_rcnt_rev_tl_op non- float64 inv_mo_sin_rcnt_tl non- float64 mort_acc non- float64 inv_mths_since_recent_bc non- float64 inv_mths_since_recent_bc_dlq non- float64 inv_mths_since_recent_inq non- float64 inv_mths_since_recent_revol_delinq non- float64 num_accts_ever_120_pd non- float64 num_actv_bc_tl non- float64 num_actv_rev_tl non- float64 num_bc_sats non- float64 num_bc_tl non- float64 num_il_tl non- float64 num_op_rev_tl non- float64 num_rev_accts non- float64 num_rev_tl_bal_gt_0 non- float64 num_sats non- float64 num_tl_120dpd_2m non- float64 num_tl_30dpd non- float64 num_tl_90g_dpd_24m non- float64 num_tl_op_past_12m non- float64 pct_tl_nvr_dlq non- float64 percent_bc_gt_75 non- float64 pub_rec_bankruptcies non- float64 tax_liens non- float64 tot_hi_cred_lim non- float64 total_bal_ex_mort non- float64 total_bc_limit non- float64 total_il_high_credit_limit non- float64 fraction_recovered non- float64 il_util_imputed non- dtypes: ( ), float64( ), int64( ), ( ) memory : + MB class 'pandas.core.frame.DataFrame' 1110171 0 to 2260697 columns 81 columns Column Null --- ------ -------------- ----- 0 1110171 null 1 1110171 null object 2 1110171 null object 3 1110171 null object 4 1110171 null 5 1110171 null object 6 1110171 null 7 1110171 null 8 1110171 null 9 1110171 null 10 1110171 null 11 1110171 null 12 1110171 null 13 1110171 null 14 1110171 null 15 1110171 null 16 1110171 null 17 1110171 null 18 1110171 null 19 1110171 null 20 1110171 null 21 1110171 null object 22 1110171 null 23 1110171 null 24 1110171 null 25 1110171 null 26 1110171 null 27 459541 null 28 459541 null 29 459541 null 30 459541 null 31 1110171 null 32 459541 null 33 459541 null 34 459541 null 35 459541 null 36 459541 null 37 459541 null 38 1110171 null 39 459541 null 40 459541 null 41 459541 null 42 1110171 null 43 1110171 null 44 1110171 null 45 1110171 null 46 1110171 null 47 1110171 null 48 1110171 null 49 1110171 null 50 1110171 null 51 1110171 null 52 1110171 null 53 1110171 null 54 1110171 null 55 1110171 null 56 1110171 null 57 1110171 null 58 1110171 null 59 1110171 null 60 1110171 null 61 1110171 null 62 1110171 null 63 1110171 null 64 1110171 null 65 1110171 null 66 1110171 null 67 1110171 null 68 1110171 null 69 1110171 null 70 1110171 null 71 1110171 null 72 1110171 null 73 1110171 null 74 1110171 null 75 1110171 null 76 1110171 null 77 1110171 null 78 1110171 null 79 1110171 null 80 1110171 null bool bool 1 74 1 object 5 usage 687.1 Good. Ready to drop rows with nulls in and move on to the DataFrame for the model that adds the new metrics for joint applications. loans_2 loans_2.dropna(axis= , inplace= ) loans_3 = loans.dropna(axis= ) loans_3.info(verbose= , null_counts= ) "index" True "index" True True < > Int64Index: entries, Data (total ): # Non- Count Dtype loan_amnt non- float64 term non- emp_length non- home_ownership non- annual_inc non- float64 purpose non- dti non- float64 delinq_2yrs non- float64 cr_hist_age_mths non- int64 fico_range_low non- float64 fico_range_high non- float64 inq_last_6mths non- float64 inv_mths_since_last_delinq non- float64 inv_mths_since_last_record non- float64 open_acc non- float64 pub_rec non- float64 revol_bal non- float64 revol_util non- float64 total_acc non- float64 collections_12_mths_ex_med non- float64 inv_mths_since_last_major_derog non- float64 application_type non- annual_inc_joint non- float64 dti_joint non- float64 acc_now_delinq non- float64 tot_coll_amt non- float64 tot_cur_bal non- float64 open_acc_6m non- float64 open_act_il non- float64 open_il_12m non- float64 open_il_24m non- float64 inv_mths_since_rcnt_il non- float64 total_bal_il non- float64 il_util non- float64 open_rv_12m non- float64 open_rv_24m non- float64 max_bal_bc non- float64 all_util non- float64 total_rev_hi_lim non- float64 inq_fi non- float64 total_cu_tl non- float64 inq_last_12m non- float64 acc_open_past_24mths non- float64 avg_cur_bal non- float64 bc_open_to_buy non- float64 bc_util non- float64 chargeoff_within_12_mths non- float64 delinq_amnt non- float64 mo_sin_old_il_acct non- float64 mo_sin_old_rev_tl_op non- float64 inv_mo_sin_rcnt_rev_tl_op non- float64 inv_mo_sin_rcnt_tl non- float64 mort_acc non- float64 inv_mths_since_recent_bc non- float64 inv_mths_since_recent_bc_dlq non- float64 inv_mths_since_recent_inq non- float64 inv_mths_since_recent_revol_delinq non- float64 num_accts_ever_120_pd non- float64 num_actv_bc_tl non- float64 num_actv_rev_tl non- float64 num_bc_sats non- float64 num_bc_tl non- float64 num_il_tl non- float64 num_op_rev_tl non- float64 num_rev_accts non- float64 num_rev_tl_bal_gt_0 non- float64 num_sats non- float64 num_tl_120dpd_2m non- float64 num_tl_30dpd non- float64 num_tl_90g_dpd_24m non- float64 num_tl_op_past_12m non- float64 pct_tl_nvr_dlq non- float64 percent_bc_gt_75 non- float64 pub_rec_bankruptcies non- float64 tax_liens non- float64 tot_hi_cred_lim non- float64 total_bal_ex_mort non- float64 total_bc_limit non- float64 total_il_high_credit_limit non- float64 revol_bal_joint non- float64 sec_app_fico_range_low non- float64 sec_app_fico_range_high non- float64 sec_app_cr_hist_age_mths non- Int64 sec_app_inq_last_6mths non- float64 sec_app_mort_acc non- float64 sec_app_open_acc non- float64 sec_app_revol_util non- float64 sec_app_open_act_il non- float64 sec_app_num_rev_accts non- float64 sec_app_chargeoff_within_12_mths non- float64 sec_app_collections_12_mths_ex_med non- float64 sec_app_inv_mths_since_last_major_derog non- float64 fraction_recovered non- float64 il_util_imputed non- dtypes: Int64( ), ( ), float64( ), int64( ), ( ) memory : + MB class 'pandas.core.frame.DataFrame' 14453 421222 to 2157147 columns 94 columns Column Null --- ------ -------------- ----- 0 14453 null 1 14453 null object 2 14453 null object 3 14453 null object 4 14453 null 5 14453 null object 6 14453 null 7 14453 null 8 14453 null 9 14453 null 10 14453 null 11 14453 null 12 14453 null 13 14453 null 14 14453 null 15 14453 null 16 14453 null 17 14453 null 18 14453 null 19 14453 null 20 14453 null 21 14453 null object 22 14453 null 23 14453 null 24 14453 null 25 14453 null 26 14453 null 27 14453 null 28 14453 null 29 14453 null 30 14453 null 31 14453 null 32 14453 null 33 14453 null 34 14453 null 35 14453 null 36 14453 null 37 14453 null 38 14453 null 39 14453 null 40 14453 null 41 14453 null 42 14453 null 43 14453 null 44 14453 null 45 14453 null 46 14453 null 47 14453 null 48 14453 null 49 14453 null 50 14453 null 51 14453 null 52 14453 null 53 14453 null 54 14453 null 55 14453 null 56 14453 null 57 14453 null 58 14453 null 59 14453 null 60 14453 null 61 14453 null 62 14453 null 63 14453 null 64 14453 null 65 14453 null 66 14453 null 67 14453 null 68 14453 null 69 14453 null 70 14453 null 71 14453 null 72 14453 null 73 14453 null 74 14453 null 75 14453 null 76 14453 null 77 14453 null 78 14453 null 79 14453 null 80 14453 null 81 14453 null 82 14453 null 83 14453 null 84 14453 null 85 14453 null 86 14453 null 87 14453 null 88 14453 null 89 14453 null 90 14453 null 91 14453 null 92 14453 null 93 14453 null bool 1 bool 1 86 1 object 5 usage 10.4 Phew, the data’s all clean now! Time for the fun part. Building the neural networks After a good deal of trial and error, I found that a network architecture with three hidden layers, each followed by a dropout layer of rate 0.3, was as good as I could find. I used ReLU activation in those hidden layers, and adam optimization and a loss metric of mean squared error in the model as a whole. I tried using mean absolute error at first, but then I found that the resulting model would essentially always guess either 1 or 0 for the output, and the majority of the dataset’s output is 1. Therefore, larger errors needed to be penalized to a greater degree, which is what mean squared error is good at. The dataset being so large, I had great results increasing the batch size for the first couple models. sklearn.model_selection train_test_split sklearn_pandas DataFrameMapper sklearn.preprocessing OneHotEncoder, OrdinalEncoder, StandardScaler tensorflow.keras Sequential, Input tensorflow.keras.layers Dense, Dropout X = data.drop(columns=[ ]) y = data[ ] X_train, X_valid, y_train, y_valid = ( train_test_split(X, y, test_size= , random_state= ) validate (X, , y, ) ) transformer = DataFrameMapper( [ (onehot_cols, OneHotEncoder(drop= )), ( list(ordinal_cols.keys()), OrdinalEncoder(categories=list(ordinal_cols.values())), ), ], default=StandardScaler(), ) X_train = transformer.fit_transform(X_train) X_valid = transformer.transform(X_valid) validate input_nodes = X_train.shape[ ] output_nodes = model = Sequential() model.add(Input((input_nodes,))) model.add(Dense( , activation= )) model.add(Dropout( , seed= )) model.add(Dense( , activation= )) model.add(Dropout( , seed= )) model.add(Dense( , activation= )) model.add(Dropout( , seed= )) model.add(Dense(output_nodes)) model.compile(optimizer= , loss= ) history = model.fit( X_train, y_train, batch_size=batch_size, epochs= , validation_data=(X_valid, y_valid) validate , verbose= , ) history.history, model, transformer print( ) history_1, _, _ = run_pipeline(loans_1, onehot_cols, ordinal_cols, batch_size= ,) print( ) history_2, _, _ = run_pipeline( loans_2, onehot_cols + new_metric_onehot_cols, ordinal_cols, batch_size= , ) print( ) history_3, _, _ = run_pipeline( loans_3, onehot_cols + new_metric_onehot_cols, ordinal_cols, batch_size= , ) from import from import from import from import from import : def run_pipeline ( data, onehot_cols, ordinal_cols, batch_size, validate=True, ) "fraction_recovered" "fraction_recovered" 0.2 0 if else None None "if_binary" if else None 1 1 64 "relu" 0.3 0 32 "relu" 0.3 1 16 "relu" 0.3 2 "adam" "mean_squared_error" 100 if else None 1 return "Model 1:" 128 "\nModel 2:" 64 "\nModel 3:" 32 Model 1: Epoch 1/100 6939/6939 - 13s - loss: 0.0848 - val_loss: 0.0603 Epoch 2/100 6939/6939 - 13s - loss: 0.0598 - val_loss: 0.0593 Epoch 3/100 6939/6939 - 13s - loss: 0.0594 - val_loss: 0.0589 Epoch 4/100 6939/6939 - 13s - loss: 0.0592 - val_loss: 0.0588 Epoch 5/100 6939/6939 - 13s - loss: 0.0591 - val_loss: 0.0591 Epoch 6/100 6939/6939 - 13s - loss: 0.0590 - val_loss: 0.0585 Epoch 7/100 6939/6939 - 13s - loss: 0.0590 - val_loss: 0.0589 Epoch 8/100 6939/6939 - 14s - loss: 0.0590 - val_loss: 0.0586 Epoch 9/100 6939/6939 - 15s - loss: 0.0590 - val_loss: 0.0586 Epoch 10/100 6939/6939 - 13s - loss: 0.0589 - val_loss: 0.0585 Epoch 11/100 6939/6939 - 13s - loss: 0.0589 - val_loss: 0.0584 Epoch 12/100 6939/6939 - 13s - loss: 0.0588 - val_loss: 0.0584 Epoch 13/100 6939/6939 - 13s - loss: 0.0588 - val_loss: 0.0584 Epoch 14/100 6939/6939 - 13s - loss: 0.0588 - val_loss: 0.0592 Epoch 15/100 6939/6939 - 13s - loss: 0.0587 - val_loss: 0.0585 Epoch 16/100 6939/6939 - 13s - loss: 0.0587 - val_loss: 0.0583 Epoch 17/100 6939/6939 - 13s - loss: 0.0587 - val_loss: 0.0582 Epoch 18/100 6939/6939 - 13s - loss: 0.0587 - val_loss: 0.0583 Epoch 19/100 6939/6939 - 13s - loss: 0.0587 - val_loss: 0.0586 Epoch 20/100 6939/6939 - 14s - loss: 0.0587 - val_loss: 0.0584 Epoch 21/100 6939/6939 - 14s - loss: 0.0587 - val_loss: 0.0585 Epoch 22/100 6939/6939 - 14s - loss: 0.0586 - val_loss: 0.0584 Epoch 23/100 6939/6939 - 13s - loss: 0.0586 - val_loss: 0.0583 Epoch 24/100 6939/6939 - 13s - loss: 0.0586 - val_loss: 0.0583 Epoch 25/100 6939/6939 - 13s - loss: 0.0586 - val_loss: 0.0585 Epoch 26/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0586 Epoch 27/100 6939/6939 - 13s - loss: 0.0586 - val_loss: 0.0582 Epoch 28/100 6939/6939 - 13s - loss: 0.0586 - val_loss: 0.0586 Epoch 29/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0586 Epoch 30/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0583 Epoch 31/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0583 Epoch 32/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0582 Epoch 33/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0583 Epoch 34/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0583 Epoch 35/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0584 Epoch 36/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0583 Epoch 37/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0583 Epoch 38/100 6939/6939 - 14s - loss: 0.0585 - val_loss: 0.0585 Epoch 39/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0582 Epoch 40/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0583 Epoch 41/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0583 Epoch 42/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0585 Epoch 43/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0582 Epoch 44/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0585 Epoch 45/100 6939/6939 - 14s - loss: 0.0585 - val_loss: 0.0583 Epoch 46/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0583 Epoch 47/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0585 Epoch 48/100 6939/6939 - 14s - loss: 0.0585 - val_loss: 0.0581 Epoch 49/100 6939/6939 - 14s - loss: 0.0584 - val_loss: 0.0583 Epoch 50/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0583 Epoch 51/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0587 Epoch 52/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0582 Epoch 53/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0582 Epoch 54/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0584 Epoch 55/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0581 Epoch 56/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0585 Epoch 57/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0581 Epoch 58/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0582 Epoch 59/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0584 Epoch 60/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0582 Epoch 61/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0584 Epoch 62/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0585 Epoch 63/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0583 Epoch 64/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0583 Epoch 65/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0584 Epoch 66/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0582 Epoch 67/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0583 Epoch 68/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0582 Epoch 69/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0586 Epoch 70/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0583 Epoch 71/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0588 Epoch 72/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0584 Epoch 73/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0585 Epoch 74/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0582 Epoch 75/100 6939/6939 - 14s - loss: 0.0584 - val_loss: 0.0583 Epoch 76/100 6939/6939 - 14s - loss: 0.0584 - val_loss: 0.0585 Epoch 77/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0583 Epoch 78/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0583 Epoch 79/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0582 Epoch 80/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0584 Epoch 81/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0583 Epoch 82/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0581 Epoch 83/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0582 Epoch 84/100 6939/6939 - 14s - loss: 0.0584 - val_loss: 0.0581 Epoch 85/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0582 Epoch 86/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0592 Epoch 87/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0582 Epoch 88/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0586 Epoch 89/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0583 Epoch 90/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0583 Epoch 91/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0584 Epoch 92/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0583 Epoch 93/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0585 Epoch 94/100 6939/6939 - 13s - loss: 0.0586 - val_loss: 0.0583 Epoch 95/100 6939/6939 - 13s - loss: 0.0586 - val_loss: 0.0583 Epoch 96/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0584 Epoch 97/100 6939/6939 - 13s - loss: 0.0586 - val_loss: 0.0593 Epoch 98/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0585 Epoch 99/100 6939/6939 - 13s - loss: 0.0585 - val_loss: 0.0582 Epoch 100/100 6939/6939 - 13s - loss: 0.0584 - val_loss: 0.0583 Model 2: Epoch 1/100 5745/5745 - 10s - loss: 0.1097 - val_loss: 0.0749 Epoch 2/100 5745/5745 - 10s - loss: 0.0752 - val_loss: 0.0739 Epoch 3/100 5745/5745 - 10s - loss: 0.0745 - val_loss: 0.0729 Epoch 4/100 5745/5745 - 10s - loss: 0.0742 - val_loss: 0.0726 Epoch 5/100 5745/5745 - 10s - loss: 0.0741 - val_loss: 0.0735 Epoch 6/100 5745/5745 - 10s - loss: 0.0739 - val_loss: 0.0733 Epoch 7/100 5745/5745 - 10s - loss: 0.0738 - val_loss: 0.0731 Epoch 8/100 5745/5745 - 10s - loss: 0.0737 - val_loss: 0.0727 Epoch 9/100 5745/5745 - 10s - loss: 0.0737 - val_loss: 0.0725 Epoch 10/100 5745/5745 - 10s - loss: 0.0736 - val_loss: 0.0729 Epoch 11/100 5745/5745 - 10s - loss: 0.0736 - val_loss: 0.0726 Epoch 12/100 5745/5745 - 10s - loss: 0.0735 - val_loss: 0.0732 Epoch 13/100 5745/5745 - 10s - loss: 0.0735 - val_loss: 0.0727 Epoch 14/100 5745/5745 - 10s - loss: 0.0735 - val_loss: 0.0726 Epoch 15/100 5745/5745 - 10s - loss: 0.0734 - val_loss: 0.0729 Epoch 16/100 5745/5745 - 10s - loss: 0.0734 - val_loss: 0.0727 Epoch 17/100 5745/5745 - 10s - loss: 0.0734 - val_loss: 0.0733 Epoch 18/100 5745/5745 - 10s - loss: 0.0733 - val_loss: 0.0733 Epoch 19/100 5745/5745 - 10s - loss: 0.0733 - val_loss: 0.0725 Epoch 20/100 5745/5745 - 10s - loss: 0.0733 - val_loss: 0.0731 Epoch 21/100 5745/5745 - 10s - loss: 0.0733 - val_loss: 0.0723 Epoch 22/100 5745/5745 - 10s - loss: 0.0733 - val_loss: 0.0728 Epoch 23/100 5745/5745 - 10s - loss: 0.0733 - val_loss: 0.0728 Epoch 24/100 5745/5745 - 10s - loss: 0.0732 - val_loss: 0.0724 Epoch 25/100 5745/5745 - 10s - loss: 0.0732 - val_loss: 0.0726 Epoch 26/100 5745/5745 - 10s - loss: 0.0732 - val_loss: 0.0727 Epoch 27/100 5745/5745 - 10s - loss: 0.0732 - val_loss: 0.0724 Epoch 28/100 5745/5745 - 10s - loss: 0.0731 - val_loss: 0.0728 Epoch 29/100 5745/5745 - 10s - loss: 0.0732 - val_loss: 0.0727 Epoch 30/100 5745/5745 - 10s - loss: 0.0731 - val_loss: 0.0725 Epoch 31/100 5745/5745 - 10s - loss: 0.0731 - val_loss: 0.0733 Epoch 32/100 5745/5745 - 10s - loss: 0.0730 - val_loss: 0.0725 Epoch 33/100 5745/5745 - 10s - loss: 0.0731 - val_loss: 0.0728 Epoch 34/100 5745/5745 - 10s - loss: 0.0731 - val_loss: 0.0727 Epoch 35/100 5745/5745 - 10s - loss: 0.0731 - val_loss: 0.0724 Epoch 36/100 5745/5745 - 10s - loss: 0.0731 - val_loss: 0.0727 Epoch 37/100 5745/5745 - 10s - loss: 0.0730 - val_loss: 0.0729 Epoch 38/100 5745/5745 - 10s - loss: 0.0730 - val_loss: 0.0735 Epoch 39/100 5745/5745 - 10s - loss: 0.0730 - val_loss: 0.0728 Epoch 40/100 5745/5745 - 11s - loss: 0.0730 - val_loss: 0.0722 Epoch 41/100 5745/5745 - 10s - loss: 0.0730 - val_loss: 0.0722 Epoch 42/100 5745/5745 - 10s - loss: 0.0730 - val_loss: 0.0729 Epoch 43/100 5745/5745 - 10s - loss: 0.0730 - val_loss: 0.0724 Epoch 44/100 5745/5745 - 10s - loss: 0.0730 - val_loss: 0.0726 Epoch 45/100 5745/5745 - 10s - loss: 0.0731 - val_loss: 0.0723 Epoch 46/100 5745/5745 - 10s - loss: 0.0730 - val_loss: 0.0727 Epoch 47/100 5745/5745 - 10s - loss: 0.0730 - val_loss: 0.0725 Epoch 48/100 5745/5745 - 10s - loss: 0.0730 - val_loss: 0.0730 Epoch 49/100 5745/5745 - 10s - loss: 0.0730 - val_loss: 0.0723 Epoch 50/100 5745/5745 - 10s - loss: 0.0730 - val_loss: 0.0732 Epoch 51/100 5745/5745 - 10s - loss: 0.0730 - val_loss: 0.0723 Epoch 52/100 5745/5745 - 10s - loss: 0.0729 - val_loss: 0.0723 Epoch 53/100 5745/5745 - 10s - loss: 0.0730 - val_loss: 0.0725 Epoch 54/100 5745/5745 - 10s - loss: 0.0729 - val_loss: 0.0727 Epoch 55/100 5745/5745 - 10s - loss: 0.0730 - val_loss: 0.0728 Epoch 56/100 5745/5745 - 10s - loss: 0.0729 - val_loss: 0.0729 Epoch 57/100 5745/5745 - 10s - loss: 0.0729 - val_loss: 0.0724 Epoch 58/100 5745/5745 - 10s - loss: 0.0729 - val_loss: 0.0723 Epoch 59/100 5745/5745 - 10s - loss: 0.0729 - val_loss: 0.0724 Epoch 60/100 5745/5745 - 10s - loss: 0.0729 - val_loss: 0.0726 Epoch 61/100 5745/5745 - 10s - loss: 0.0729 - val_loss: 0.0728 Epoch 62/100 5745/5745 - 10s - loss: 0.0728 - val_loss: 0.0724 Epoch 63/100 5745/5745 - 10s - loss: 0.0728 - val_loss: 0.0724 Epoch 64/100 5745/5745 - 10s - loss: 0.0728 - val_loss: 0.0728 Epoch 65/100 5745/5745 - 10s - loss: 0.0728 - val_loss: 0.0725 Epoch 66/100 5745/5745 - 10s - loss: 0.0729 - val_loss: 0.0724 Epoch 67/100 5745/5745 - 10s - loss: 0.0728 - val_loss: 0.0723 Epoch 68/100 5745/5745 - 10s - loss: 0.0728 - val_loss: 0.0725 Epoch 69/100 5745/5745 - 10s - loss: 0.0728 - val_loss: 0.0725 Epoch 70/100 5745/5745 - 10s - loss: 0.0728 - val_loss: 0.0725 Epoch 71/100 5745/5745 - 10s - loss: 0.0727 - val_loss: 0.0724 Epoch 72/100 5745/5745 - 10s - loss: 0.0728 - val_loss: 0.0726 Epoch 73/100 5745/5745 - 10s - loss: 0.0727 - val_loss: 0.0723 Epoch 74/100 5745/5745 - 10s - loss: 0.0727 - val_loss: 0.0728 Epoch 75/100 5745/5745 - 10s - loss: 0.0728 - val_loss: 0.0727 Epoch 76/100 5745/5745 - 10s - loss: 0.0727 - val_loss: 0.0726 Epoch 77/100 5745/5745 - 10s - loss: 0.0728 - val_loss: 0.0723 Epoch 78/100 5745/5745 - 10s - loss: 0.0727 - val_loss: 0.0726 Epoch 79/100 5745/5745 - 10s - loss: 0.0727 - val_loss: 0.0725 Epoch 80/100 5745/5745 - 10s - loss: 0.0727 - val_loss: 0.0725 Epoch 81/100 5745/5745 - 10s - loss: 0.0727 - val_loss: 0.0725 Epoch 82/100 5745/5745 - 10s - loss: 0.0726 - val_loss: 0.0726 Epoch 83/100 5745/5745 - 10s - loss: 0.0727 - val_loss: 0.0727 Epoch 84/100 5745/5745 - 10s - loss: 0.0727 - val_loss: 0.0725 Epoch 85/100 5745/5745 - 10s - loss: 0.0727 - val_loss: 0.0727 Epoch 86/100 5745/5745 - 10s - loss: 0.0727 - val_loss: 0.0725 Epoch 87/100 5745/5745 - 10s - loss: 0.0727 - val_loss: 0.0726 Epoch 88/100 5745/5745 - 10s - loss: 0.0726 - val_loss: 0.0726 Epoch 89/100 5745/5745 - 10s - loss: 0.0727 - val_loss: 0.0723 Epoch 90/100 5745/5745 - 10s - loss: 0.0726 - val_loss: 0.0728 Epoch 91/100 5745/5745 - 10s - loss: 0.0727 - val_loss: 0.0725 Epoch 92/100 5745/5745 - 10s - loss: 0.0727 - val_loss: 0.0726 Epoch 93/100 5745/5745 - 10s - loss: 0.0727 - val_loss: 0.0726 Epoch 94/100 5745/5745 - 10s - loss: 0.0727 - val_loss: 0.0727 Epoch 95/100 5745/5745 - 10s - loss: 0.0726 - val_loss: 0.0726 Epoch 96/100 5745/5745 - 10s - loss: 0.0727 - val_loss: 0.0726 Epoch 97/100 5745/5745 - 10s - loss: 0.0726 - val_loss: 0.0724 Epoch 98/100 5745/5745 - 10s - loss: 0.0726 - val_loss: 0.0726 Epoch 99/100 5745/5745 - 9s - loss: 0.0726 - val_loss: 0.0728 Epoch 100/100 5745/5745 - 10s - loss: 0.0726 - val_loss: 0.0723 Model 3: Epoch 1/100 362/362 - 1s - loss: 0.4367 - val_loss: 0.1752 Epoch 2/100 362/362 - 1s - loss: 0.1696 - val_loss: 0.1288 Epoch 3/100 362/362 - 1s - loss: 0.1409 - val_loss: 0.1218 Epoch 4/100 362/362 - 1s - loss: 0.1251 - val_loss: 0.1188 Epoch 5/100 362/362 - 1s - loss: 0.1218 - val_loss: 0.1163 Epoch 6/100 362/362 - 1s - loss: 0.1157 - val_loss: 0.1144 Epoch 7/100 362/362 - 1s - loss: 0.1122 - val_loss: 0.1131 Epoch 8/100 362/362 - 1s - loss: 0.1108 - val_loss: 0.1121 Epoch 9/100 362/362 - 1s - loss: 0.1092 - val_loss: 0.1117 Epoch 10/100 362/362 - 1s - loss: 0.1073 - val_loss: 0.1118 Epoch 11/100 362/362 - 1s - loss: 0.1063 - val_loss: 0.1112 Epoch 12/100 362/362 - 1s - loss: 0.1045 - val_loss: 0.1108 Epoch 13/100 362/362 - 1s - loss: 0.1041 - val_loss: 0.1101 Epoch 14/100 362/362 - 1s - loss: 0.1033 - val_loss: 0.1092 Epoch 15/100 362/362 - 1s - loss: 0.1027 - val_loss: 0.1096 Epoch 16/100 362/362 - 1s - loss: 0.1022 - val_loss: 0.1090 Epoch 17/100 362/362 - 1s - loss: 0.1017 - val_loss: 0.1121 Epoch 18/100 362/362 - 1s - loss: 0.1000 - val_loss: 0.1100 Epoch 19/100 362/362 - 1s - loss: 0.0997 - val_loss: 0.1118 Epoch 20/100 362/362 - 1s - loss: 0.0997 - val_loss: 0.1117 Epoch 21/100 362/362 - 1s - loss: 0.0992 - val_loss: 0.1163 Epoch 22/100 362/362 - 1s - loss: 0.0986 - val_loss: 0.1150 Epoch 23/100 362/362 - 1s - loss: 0.0986 - val_loss: 0.1139 Epoch 24/100 362/362 - 1s - loss: 0.0983 - val_loss: 0.1100 Epoch 25/100 362/362 - 1s - loss: 0.0979 - val_loss: 0.1125 Epoch 26/100 362/362 - 1s - loss: 0.0969 - val_loss: 0.1117 Epoch 27/100 362/362 - 1s - loss: 0.0960 - val_loss: 0.1098 Epoch 28/100 362/362 - 1s - loss: 0.0957 - val_loss: 0.1124 Epoch 29/100 362/362 - 1s - loss: 0.0956 - val_loss: 0.1133 Epoch 30/100 362/362 - 1s - loss: 0.0950 - val_loss: 0.1080 Epoch 31/100 362/362 - 1s - loss: 0.0953 - val_loss: 0.1086 Epoch 32/100 362/362 - 1s - loss: 0.0944 - val_loss: 0.1091 Epoch 33/100 362/362 - 1s - loss: 0.0952 - val_loss: 0.1096 Epoch 34/100 362/362 - 1s - loss: 0.0932 - val_loss: 0.1091 Epoch 35/100 362/362 - 1s - loss: 0.0919 - val_loss: 0.1087 Epoch 36/100 362/362 - 1s - loss: 0.0921 - val_loss: 0.1123 Epoch 37/100 362/362 - 1s - loss: 0.0927 - val_loss: 0.1110 Epoch 38/100 362/362 - 1s - loss: 0.0920 - val_loss: 0.1111 Epoch 39/100 362/362 - 1s - loss: 0.0909 - val_loss: 0.1108 Epoch 40/100 362/362 - 1s - loss: 0.0914 - val_loss: 0.1140 Epoch 41/100 362/362 - 1s - loss: 0.0902 - val_loss: 0.1124 Epoch 42/100 362/362 - 1s - loss: 0.0895 - val_loss: 0.1116 Epoch 43/100 362/362 - 1s - loss: 0.0908 - val_loss: 0.1113 Epoch 44/100 362/362 - 1s - loss: 0.0885 - val_loss: 0.1126 Epoch 45/100 362/362 - 1s - loss: 0.0869 - val_loss: 0.1137 Epoch 46/100 362/362 - 1s - loss: 0.0883 - val_loss: 0.1127 Epoch 47/100 362/362 - 1s - loss: 0.0884 - val_loss: 0.1112 Epoch 48/100 362/362 - 1s - loss: 0.0880 - val_loss: 0.1115 Epoch 49/100 362/362 - 1s - loss: 0.0886 - val_loss: 0.1106 Epoch 50/100 362/362 - 1s - loss: 0.0863 - val_loss: 0.1152 Epoch 51/100 362/362 - 1s - loss: 0.0864 - val_loss: 0.1136 Epoch 52/100 362/362 - 1s - loss: 0.0861 - val_loss: 0.1130 Epoch 53/100 362/362 - 1s - loss: 0.0851 - val_loss: 0.1171 Epoch 54/100 362/362 - 1s - loss: 0.0850 - val_loss: 0.1150 Epoch 55/100 362/362 - 1s - loss: 0.0850 - val_loss: 0.1139 Epoch 56/100 362/362 - 1s - loss: 0.0851 - val_loss: 0.1125 Epoch 57/100 362/362 - 1s - loss: 0.0853 - val_loss: 0.1132 Epoch 58/100 362/362 - 1s - loss: 0.0827 - val_loss: 0.1129 Epoch 59/100 362/362 - 1s - loss: 0.0838 - val_loss: 0.1145 Epoch 60/100 362/362 - 1s - loss: 0.0822 - val_loss: 0.1222 Epoch 61/100 362/362 - 1s - loss: 0.0811 - val_loss: 0.1152 Epoch 62/100 362/362 - 1s - loss: 0.0831 - val_loss: 0.1127 Epoch 63/100 362/362 - 1s - loss: 0.0805 - val_loss: 0.1184 Epoch 64/100 362/362 - 1s - loss: 0.0820 - val_loss: 0.1149 Epoch 65/100 362/362 - 1s - loss: 0.0812 - val_loss: 0.1138 Epoch 66/100 362/362 - 1s - loss: 0.0799 - val_loss: 0.1149 Epoch 67/100 362/362 - 1s - loss: 0.0816 - val_loss: 0.1132 Epoch 68/100 362/362 - 1s - loss: 0.0806 - val_loss: 0.1179 Epoch 69/100 362/362 - 1s - loss: 0.0789 - val_loss: 0.1172 Epoch 70/100 362/362 - 1s - loss: 0.0791 - val_loss: 0.1131 Epoch 71/100 362/362 - 1s - loss: 0.0786 - val_loss: 0.1140 Epoch 72/100 362/362 - 1s - loss: 0.0799 - val_loss: 0.1165 Epoch 73/100 362/362 - 1s - loss: 0.0795 - val_loss: 0.1137 Epoch 74/100 362/362 - 1s - loss: 0.0785 - val_loss: 0.1169 Epoch 75/100 362/362 - 1s - loss: 0.0778 - val_loss: 0.1159 Epoch 76/100 362/362 - 1s - loss: 0.0785 - val_loss: 0.1144 Epoch 77/100 362/362 - 1s - loss: 0.0767 - val_loss: 0.1141 Epoch 78/100 362/362 - 1s - loss: 0.0756 - val_loss: 0.1132 Epoch 79/100 362/362 - 1s - loss: 0.0769 - val_loss: 0.1171 Epoch 80/100 362/362 - 1s - loss: 0.0782 - val_loss: 0.1159 Epoch 81/100 362/362 - 1s - loss: 0.0770 - val_loss: 0.1153 Epoch 82/100 362/362 - 1s - loss: 0.0761 - val_loss: 0.1139 Epoch 83/100 362/362 - 1s - loss: 0.0770 - val_loss: 0.1164 Epoch 84/100 362/362 - 1s - loss: 0.0768 - val_loss: 0.1146 Epoch 85/100 362/362 - 1s - loss: 0.0759 - val_loss: 0.1182 Epoch 86/100 362/362 - 1s - loss: 0.0754 - val_loss: 0.1180 Epoch 87/100 362/362 - 1s - loss: 0.0760 - val_loss: 0.1164 Epoch 88/100 362/362 - 1s - loss: 0.0755 - val_loss: 0.1176 Epoch 89/100 362/362 - 1s - loss: 0.0739 - val_loss: 0.1167 Epoch 90/100 362/362 - 1s - loss: 0.0751 - val_loss: 0.1163 Epoch 91/100 362/362 - 1s - loss: 0.0740 - val_loss: 0.1192 Epoch 92/100 362/362 - 1s - loss: 0.0745 - val_loss: 0.1296 Epoch 93/100 362/362 - 1s - loss: 0.0745 - val_loss: 0.1165 Epoch 94/100 362/362 - 1s - loss: 0.0743 - val_loss: 0.1192 Epoch 95/100 362/362 - 1s - loss: 0.0734 - val_loss: 0.1166 Epoch 96/100 362/362 - 1s - loss: 0.0743 - val_loss: 0.1150 Epoch 97/100 362/362 - 1s - loss: 0.0728 - val_loss: 0.1182 Epoch 98/100 362/362 - 1s - loss: 0.0737 - val_loss: 0.1185 Epoch 99/100 362/362 - 1s - loss: 0.0729 - val_loss: 0.1185 Epoch 100/100 362/362 - 1s - loss: 0.0731 - val_loss: 0.1200 The first model performed best, settling around a mean squared error of 0.0583 (though it seems even after setting inside and inside the dropout layers, there’s still a bit of entropy left in the training of the model, so if you run this notebook yourself, the course of your training may look a little different). random_state train_test_split seed Apparently the additional in the first dataset did more to aid in training than the additional in the subsequent sets. And the dropout layers didn’t stop the third model from overfitting anyway. records metrics sns.lineplot(x=range( , ), y=history_1[ ], label= ) sns.lineplot(x=range( , ), y=history_1[ ], label= ) plt.xlabel( ) plt.title( ) plt.show() 1 101 "loss" "loss" 1 101 "val_loss" "val_loss" "epoch" "Model 1 loss metrics during training" Saving the final model First I need to the final model, training ’s architecture on the full dataset. Then I’ll save the model to disk with its function and save the data transformer using so I can use it in the API. create model_1 save joblib joblib _, final_model, final_transformer = run_pipeline( loans_1, onehot_cols, ordinal_cols, batch_size= , validate= , ) final_model.save( ) joblib.dump(final_transformer, ) import 128 False "loan_risk_model" "data_transformer.joblib" Epoch 1/100 8674/8674 - 14s - loss: 0.0804 Epoch 2/100 8674/8674 - 14s - loss: 0.0598 Epoch 3/100 8674/8674 - 14s - loss: 0.0594 Epoch 4/100 8674/8674 - 15s - loss: 0.0593 Epoch 5/100 8674/8674 - 14s - loss: 0.0592 Epoch 6/100 8674/8674 - 15s - loss: 0.0591 Epoch 7/100 8674/8674 - 14s - loss: 0.0591 Epoch 8/100 8674/8674 - 14s - loss: 0.0591 Epoch 9/100 8674/8674 - 14s - loss: 0.0590 Epoch 10/100 8674/8674 - 14s - loss: 0.0591 Epoch 11/100 8674/8674 - 14s - loss: 0.0590 Epoch 12/100 8674/8674 - 14s - loss: 0.0590 Epoch 13/100 8674/8674 - 14s - loss: 0.0589 Epoch 14/100 8674/8674 - 14s - loss: 0.0590 Epoch 15/100 8674/8674 - 14s - loss: 0.0590 Epoch 16/100 8674/8674 - 14s - loss: 0.0589 Epoch 17/100 8674/8674 - 14s - loss: 0.0589 Epoch 18/100 8674/8674 - 14s - loss: 0.0589 Epoch 19/100 8674/8674 - 14s - loss: 0.0589 Epoch 20/100 8674/8674 - 14s - loss: 0.0589 Epoch 21/100 8674/8674 - 14s - loss: 0.0589 Epoch 22/100 8674/8674 - 14s - loss: 0.0589 Epoch 23/100 8674/8674 - 14s - loss: 0.0589 Epoch 24/100 8674/8674 - 15s - loss: 0.0589 Epoch 25/100 8674/8674 - 14s - loss: 0.0589 Epoch 26/100 8674/8674 - 14s - loss: 0.0589 Epoch 27/100 8674/8674 - 14s - loss: 0.0589 Epoch 28/100 8674/8674 - 14s - loss: 0.0589 Epoch 29/100 8674/8674 - 14s - loss: 0.0589 Epoch 30/100 8674/8674 - 14s - loss: 0.0589 Epoch 31/100 8674/8674 - 14s - loss: 0.0589 Epoch 32/100 8674/8674 - 14s - loss: 0.0589 Epoch 33/100 8674/8674 - 14s - loss: 0.0589 Epoch 34/100 8674/8674 - 15s - loss: 0.0589 Epoch 35/100 8674/8674 - 14s - loss: 0.0589 Epoch 36/100 8674/8674 - 14s - loss: 0.0589 Epoch 37/100 8674/8674 - 14s - loss: 0.0589 Epoch 38/100 8674/8674 - 14s - loss: 0.0589 Epoch 39/100 8674/8674 - 14s - loss: 0.0588 Epoch 40/100 8674/8674 - 16s - loss: 0.0588 Epoch 41/100 8674/8674 - 14s - loss: 0.0589 Epoch 42/100 8674/8674 - 14s - loss: 0.0588 Epoch 43/100 8674/8674 - 14s - loss: 0.0589 Epoch 44/100 8674/8674 - 14s - loss: 0.0588 Epoch 45/100 8674/8674 - 14s - loss: 0.0588 Epoch 46/100 8674/8674 - 14s - loss: 0.0588 Epoch 47/100 8674/8674 - 15s - loss: 0.0589 Epoch 48/100 8674/8674 - 15s - loss: 0.0589 Epoch 49/100 8674/8674 - 15s - loss: 0.0588 Epoch 50/100 8674/8674 - 14s - loss: 0.0589 Epoch 51/100 8674/8674 - 14s - loss: 0.0588 Epoch 52/100 8674/8674 - 14s - loss: 0.0589 Epoch 53/100 8674/8674 - 14s - loss: 0.0588 Epoch 54/100 8674/8674 - 14s - loss: 0.0588 Epoch 55/100 8674/8674 - 14s - loss: 0.0589 Epoch 56/100 8674/8674 - 14s - loss: 0.0588 Epoch 57/100 8674/8674 - 14s - loss: 0.0588 Epoch 58/100 8674/8674 - 14s - loss: 0.0588 Epoch 59/100 8674/8674 - 14s - loss: 0.0588 Epoch 60/100 8674/8674 - 14s - loss: 0.0589 Epoch 61/100 8674/8674 - 14s - loss: 0.0588 Epoch 62/100 8674/8674 - 14s - loss: 0.0588 Epoch 63/100 8674/8674 - 14s - loss: 0.0588 Epoch 64/100 8674/8674 - 14s - loss: 0.0589 Epoch 65/100 8674/8674 - 14s - loss: 0.0588 Epoch 66/100 8674/8674 - 14s - loss: 0.0588 Epoch 67/100 8674/8674 - 14s - loss: 0.0588 Epoch 68/100 8674/8674 - 14s - loss: 0.0588 Epoch 69/100 8674/8674 - 14s - loss: 0.0589 Epoch 70/100 8674/8674 - 14s - loss: 0.0588 Epoch 71/100 8674/8674 - 15s - loss: 0.0588 Epoch 72/100 8674/8674 - 15s - loss: 0.0588 Epoch 73/100 8674/8674 - 15s - loss: 0.0588 Epoch 74/100 8674/8674 - 14s - loss: 0.0588 Epoch 75/100 8674/8674 - 14s - loss: 0.0588 Epoch 76/100 8674/8674 - 14s - loss: 0.0588 Epoch 77/100 8674/8674 - 14s - loss: 0.0588 Epoch 78/100 8674/8674 - 14s - loss: 0.0589 Epoch 79/100 8674/8674 - 14s - loss: 0.0588 Epoch 80/100 8674/8674 - 14s - loss: 0.0589 Epoch 81/100 8674/8674 - 14s - loss: 0.0589 Epoch 82/100 8674/8674 - 14s - loss: 0.0589 Epoch 83/100 8674/8674 - 14s - loss: 0.0589 Epoch 84/100 8674/8674 - 14s - loss: 0.0589 Epoch 85/100 8674/8674 - 14s - loss: 0.0588 Epoch 86/100 8674/8674 - 14s - loss: 0.0588 Epoch 87/100 8674/8674 - 14s - loss: 0.0588 Epoch 88/100 8674/8674 - 14s - loss: 0.0588 Epoch 89/100 8674/8674 - 14s - loss: 0.0588 Epoch 90/100 8674/8674 - 14s - loss: 0.0588 Epoch 91/100 8674/8674 - 14s - loss: 0.0588 Epoch 92/100 8674/8674 - 14s - loss: 0.0588 Epoch 93/100 8674/8674 - 14s - loss: 0.0588 Epoch 94/100 8674/8674 - 14s - loss: 0.0588 Epoch 95/100 8674/8674 - 14s - loss: 0.0588 Epoch 96/100 8674/8674 - 14s - loss: 0.0588 Epoch 97/100 8674/8674 - 15s - loss: 0.0587 Epoch 98/100 8674/8674 - 14s - loss: 0.0588 Epoch 99/100 8674/8674 - 15s - loss: 0.0588 Epoch 100/100 8674/8674 - 14s - loss: 0.0588 ['data_transformer.joblib'] Building the API I first tried building this API and its demonstrational front end on , which, officially, only supports Node.js back ends, but unofficially you can get a Python server running there (which I've using ). Glitch done before Flask When I was almost finished, though, I tried importing TensorFlow to load my model, and it was then that I discovered that unlike Node.js dependencies, Python dependencies get installed to your project's disk space on Glitch, and not even their pro plan provides enough space to contain the entire TensorFlow library. Which totally makes sense—I certainly wasn't using the platform as intended. Then I discovered ! They have plenty of common Python libraries already installed out-of-the-box, including TensorFlow, so I got everything working perfectly there. PythonAnywhere So if you'd like to check it out; the front end includes a form where you can fill in all the parameters for the API request, and there are a couple of buttons that let you fill the form with typical examples from the dataset (since there are a of fields to fill in). head on over lot Or you can send a GET request to if you really want to include every parameter in your query string. In either case, you're also more than welcome to take a look at its source . https://tywmick.pythonanywhere.com/api/predict on GitHub Evaluating performance A month and a half after I first published this, I have decided I should evaluate whether or not this predictive model is actually . Tipping my hat once again to Michael Wurm for , I figure comparing my model's performance to a selection method based on the loan grade assigned by LendingClub ought to be sufficient. I'll save a version of to disk, adding a few original columns back for evaluation, so I can do this in a new notebook (this article's long enough already). useful the idea loans_1 expected.rename( , inplace= ) loans_for_eval = loans_1.join([loans_raw[[ , ]], expected]) loans_for_eval.head() "expected_return" True "grade" "sub_grade" ┌───┬───────────┬───────────┬────────────┬────────────────┬────────────┬────────────────────┬───────┬─────────────┬──────────────────┬────────────────┬─────┬──────────────────────┬───────────┬─────────────────┬───────────────────┬────────────────┬────────────────────────────┬────────────────────┬───────┬───────────┬─────────────────┐ │ │ loan_amnt │ term │ emp_length │ home_ownership │ annual_inc │ purpose │ dti │ delinq_2yrs │ cr_hist_age_mths │ fico_range_low │ ... │ pub_rec_bankruptcies │ tax_liens │ tot_hi_cred_lim │ total_bal_ex_mort │ total_bc_limit │ total_il_high_credit_limit │ fraction_recovered │ grade │ sub_grade │ expected_return │ ├───┼───────────┼───────────┼────────────┼────────────────┼────────────┼────────────────────┼───────┼─────────────┼──────────────────┼────────────────┼─────┼──────────────────────┼───────────┼─────────────────┼───────────────────┼────────────────┼────────────────────────────┼────────────────────┼───────┼───────────┼─────────────────┤ │ 0 │ 3600.0 │ 36 months │ 10+ years │ MORTGAGE │ 55000.0 │ debt_consolidation │ 5.91 │ 0.0 │ 148.0 │ 675.0 │ ... │ 0.0 │ 0.0 │ 178050.0 │ 7746.0 │ 2400.0 │ 13734.0 │ 1.0 │ C │ C4 │ 4429.08 │ │ 1 │ 24700.0 │ 36 months │ 10+ years │ MORTGAGE │ 65000.0 │ small_business │ 16.06 │ 1.0 │ 192.0 │ 715.0 │ ... │ 0.0 │ 0.0 │ 314017.0 │ 39475.0 │ 79300.0 │ 24667.0 │ 1.0 │ C │ C1 │ 29530.08 │ │ 2 │ 20000.0 │ 60 months │ 10+ years │ MORTGAGE │ 63000.0 │ home_improvement │ 10.78 │ 0.0 │ 184.0 │ 695.0 │ ... │ 0.0 │ 0.0 │ 218418.0 │ 18696.0 │ 6200.0 │ 14877.0 │ 1.0 │ B │ B4 │ 25959.60 │ │ 4 │ 10400.0 │ 60 months │ 3 years │ MORTGAGE │ 104433.0 │ major_purchase │ 25.37 │ 1.0 │ 210.0 │ 695.0 │ ... │ 0.0 │ 0.0 │ 439570.0 │ 95768.0 │ 20300.0 │ 88097.0 │ 1.0 │ F │ F1 │ 17394.60 │ │ 5 │ 11950.0 │ 36 months │ 4 years │ RENT │ 34000.0 │ debt_consolidation │ 10.20 │ 0.0 │ 338.0 │ 690.0 │ ... │ 0.0 │ 0.0 │ 16900.0 │ 12798.0 │ 9400.0 │ 4000.0 │ 1.0 │ C │ C3 │ 14586.48 │ └───┴───────────┴───────────┴────────────┴────────────────┴────────────┴────────────────────┴───────┴─────────────┴──────────────────┴────────────────┴─────┴──────────────────────┴───────────┴─────────────────┴───────────────────┴────────────────┴────────────────────────────┴────────────────────┴───────┴───────────┴─────────────────┘ 5 rows × 69 columns joblib.dump(loans_for_eval, ) "loans_for_eval.joblib" ['loans_for_eval.joblib'] Oh, and I should save my neural network pipeline, too. joblib.dump(run_pipeline, ) "pipeline.joblib" ['pipeline.joblib'] That'll do it. Stay tuned for the sequel! One of the best/worst things about machine learning is that your models have room for improvement. I mentioned a couple ideas along the way above for how I could improve the model in the future, but what's the first thing you would tweak in this model? I'd love to hear in the comments below. always Previously published at https://tymick.me/blog/loan-risk-neural-network