Addressing Data Imbalance in Bankruptcy Prediction: Comparative Literature Insights

Authors:

(1) Harrison Mateika, Northwestern University ([email protected]);

(2) Juannan Jia, Northwestern University ([email protected]);

(3) Linda Lillard, Northwestern University ([email protected]);

(4) Noah Cronbaugh, Northwestern University ([email protected]);

(5) Will Shin, Northwestern University ([email protected]).

Table of Links

2. Literature Review

Predicting corporate bankruptcy using financial ratios has been something that many data scientists have been looking to do with statistical models for a long period of time. One of the most famous papers in that instance was Edward Altman’s paper titled “Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy”, a paper that came out in 1968. In that paper, Altman utilized multiple discriminant analysis on 22 financial ratios of 66 companies, half of which filed for bankruptcy. The results were that only 5 variables were needed to accurately predict bankruptcy. Utilizing the model, Altman was able to predict bankruptcy for companies very accurately for their first year. However, that accuracy waned as the years went forward:

Much of the work in predicting the bankruptcy of companies has largely been built off of this work. However, regarding machine learning terms, the challenge for many of these papers is that it is difficult to find data with many bankruptcy occurrences. The main reason for this is simple: only 3% of public companies file for bankruptcy. With such a large imbalance between instances, it is going to be challenging to build a precise model.

This challenge was an issue for Vikram Devatha in his article entitled “Predicting bankruptcy using Machine Learning”. In his experiment, Devatha utilized several classifications algorithms to help predict bankruptcy including Logistic Regression, Perceptron as a classifier, Deep Neural Network Classifiers, Fischer Linear Discriminant Analysis, K Nearest Neighbor Classifier, Naive Bayes Classifier, Decision Tree, Bagged Decision Trees, Random Forest, Gradient Boosting and Support Vector Machine. He also expanded the concepts by using different initializations. For instance, the author used different k values from 1 - 19 to compare the results for K Nearest Neighbor, and he used a range of 50-500 tree increments for his Random Forest method. His results found that Gradient Boosting and Bagged Decision Trees performed best among the other algorithms when using the Sensitivity (True Positive) measurement as his final result. In this instance, his sensitivity analysis performed rather well, with a True Positive of 85.05%.

Kou, Xu, Peng et. al explored the imbalance of bankruptcies when evaluating the models for their paper Bankruptcy Prediction for SMEs using transactional data and two-stage multiobjective feature selection. In the paper, the team compared the model results of their original bankruptcy data set to an undersample and oversample of the data set. The results were that the models performed better on the imbalanced data set and that the sensitivity of the imbalanced data set did not improve when utilizing the sampling techniques. The paper also found that the ensemble method XGB performed best when compared to the other models when looking at AUC. Model 5 in this paper was reflective of transactional data being included in predicting bankruptcies, giving this paper some real practical implications.

Liang, Lu, Tsai, and Shih for their paper Financial ratios and corporate governance indicators in bankruptcy prediction: A comprehensive study looked at the same data set that we are utilizing for this paper. Their primary aim was to look at the corporate governance ratios associated with this data and determine whether or not it increased the data’s predictive power. For their results, the corporate governance ratios did prove to increase the predictive power of determining bankruptcies. When looking at the statistical techniques they utilized (support vector machines, k-nearest neighbor, naive Bayes classifier, classification and regression tree, and multilayer perception methods), they found that SVMs performed the best.

Regarding which papers performed the best or worst, all of these papers struggle with the same challenge: a large data imbalance regarding what companies went into bankruptcy and those that did not. Each of these papers tackles the approach in a different way, and each has a multitude of implications. Altman’s paper was a relic of its time, and while still highly relevant, it lacks the sophisticated techniques that we have today to determine bankruptcy. Furthermore, his paper readily admits that it lacks practical implications. The papers regarding Kou, Xu, Pen et. al and Liang, Lu, Tsai et.al each have their own practical implications associated with them: for Kou, Xu, Pen et. al the implication was mainly to do with transactional data while for Liang, Lu, Tsai et.al it had to do with corporate governance data. Due to both having real-world implications, we have decided that both deserve the most recognition as the most rigorous analysis of bankruptcy data.

Regarding our contribution to bankruptcy data literature, we differentiate ourselves from these papers in a few ways. First, we are oversampling the corporate governance data used by Liang, Lu, Tsai et.al to determine whether or not it outperforms the data imbalance. This is mainly to see if the results from Kou, Xu, Pen et. al is replicable with other bankruptcy data. Second, we are implementing rigorous analysis to our methods. While we are looking to determine the best model utilizing manual feature selection, modeling, and evaluation methods, we are also determining whether or not our methods outperform automatic machine learning methods through AutoML. This determination will mean that there will be practical implications to our paper no matter what: either we can outperform automated methods in determining bankruptcies, or perhaps automated machine learning methods are the best method of dealing with this type of data.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.