Corporate default prediction: a comparison between Merton model and random forest in an environment of data scarcity
(2022) NEKN02 20221Department of Economics
- Abstract
- The aim of this paper is to compare the performance of the Merton model to a machine learning technique (random forest), in a context where the number of predictors is low or the dataset is quite small. Since random forest is a data-intensive method, the main goal is to find the minimum number of explanatory variables and observations that is needed for it to perform at least as well as the Merton model, an approach developed in the 70s that gives the probability of the firm defaulting. Results suggest that a minimum of 13 predictors is required for both models to have a similar performance and that the dataset should be formed by no less than 9,600 observations for random forest to be as accurate as the classic approach by Merton,... (More)
- The aim of this paper is to compare the performance of the Merton model to a machine learning technique (random forest), in a context where the number of predictors is low or the dataset is quite small. Since random forest is a data-intensive method, the main goal is to find the minimum number of explanatory variables and observations that is needed for it to perform at least as well as the Merton model, an approach developed in the 70s that gives the probability of the firm defaulting. Results suggest that a minimum of 13 predictors is required for both models to have a similar performance and that the dataset should be formed by no less than 9,600 observations for random forest to be as accurate as the classic approach by Merton, providing that the proportion of defaults in the dataset is around 0.18%. Our study also suggests that the application of the over-sampling technique SMOTE combined with 100% under-sampling of the majority class leads to superior results for the random forest, with an accuracy higher than 0.8 as measured by area under curve (AUC). In general, our findings indicate that, although machine learning techniques perform better in absolute terms when predicting corporate defaults, it is important to consider the amount of data available and its quality before trying to apply any of them, since, in extreme data scarcity scenarios, traditional approaches as the Merton model perform better. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9084730
- author
- Díaz García, Aitor LU and Mirosnikovs, Matiss LU
- supervisor
- organization
- course
- NEKN02 20221
- year
- 2022
- type
- H1 - Master's Degree (One Year)
- subject
- keywords
- Merton model, random forest, default prediction, SMOTE.
- language
- English
- id
- 9084730
- date added to LUP
- 2022-10-10 09:33:52
- date last changed
- 2022-10-10 09:33:52
@misc{9084730, abstract = {{The aim of this paper is to compare the performance of the Merton model to a machine learning technique (random forest), in a context where the number of predictors is low or the dataset is quite small. Since random forest is a data-intensive method, the main goal is to find the minimum number of explanatory variables and observations that is needed for it to perform at least as well as the Merton model, an approach developed in the 70s that gives the probability of the firm defaulting. Results suggest that a minimum of 13 predictors is required for both models to have a similar performance and that the dataset should be formed by no less than 9,600 observations for random forest to be as accurate as the classic approach by Merton, providing that the proportion of defaults in the dataset is around 0.18%. Our study also suggests that the application of the over-sampling technique SMOTE combined with 100% under-sampling of the majority class leads to superior results for the random forest, with an accuracy higher than 0.8 as measured by area under curve (AUC). In general, our findings indicate that, although machine learning techniques perform better in absolute terms when predicting corporate defaults, it is important to consider the amount of data available and its quality before trying to apply any of them, since, in extreme data scarcity scenarios, traditional approaches as the Merton model perform better.}}, author = {{Díaz García, Aitor and Mirosnikovs, Matiss}}, language = {{eng}}, note = {{Student Paper}}, title = {{Corporate default prediction: a comparison between Merton model and random forest in an environment of data scarcity}}, year = {{2022}}, }