Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Corporate default prediction: a comparison between Merton model and random forest in an environment of data scarcity

Díaz García, Aitor LU and Mirosnikovs, Matiss LU (2022) NEKN02 20221
Department of Economics
Abstract
The aim of this paper is to compare the performance of the Merton model to a machine learning technique (random forest), in a context where the number of predictors is low or the dataset is quite small. Since random forest is a data-intensive method, the main goal is to find the minimum number of explanatory variables and observations that is needed for it to perform at least as well as the Merton model, an approach developed in the 70s that gives the probability of the firm defaulting. Results suggest that a minimum of 13 predictors is required for both models to have a similar performance and that the dataset should be formed by no less than 9,600 observations for random forest to be as accurate as the classic approach by Merton,... (More)
The aim of this paper is to compare the performance of the Merton model to a machine learning technique (random forest), in a context where the number of predictors is low or the dataset is quite small. Since random forest is a data-intensive method, the main goal is to find the minimum number of explanatory variables and observations that is needed for it to perform at least as well as the Merton model, an approach developed in the 70s that gives the probability of the firm defaulting. Results suggest that a minimum of 13 predictors is required for both models to have a similar performance and that the dataset should be formed by no less than 9,600 observations for random forest to be as accurate as the classic approach by Merton, providing that the proportion of defaults in the dataset is around 0.18%. Our study also suggests that the application of the over-sampling technique SMOTE combined with 100% under-sampling of the majority class leads to superior results for the random forest, with an accuracy higher than 0.8 as measured by area under curve (AUC). In general, our findings indicate that, although machine learning techniques perform better in absolute terms when predicting corporate defaults, it is important to consider the amount of data available and its quality before trying to apply any of them, since, in extreme data scarcity scenarios, traditional approaches as the Merton model perform better. (Less)
Please use this url to cite or link to this publication:
author
Díaz García, Aitor LU and Mirosnikovs, Matiss LU
supervisor
organization
course
NEKN02 20221
year
type
H1 - Master's Degree (One Year)
subject
keywords
Merton model, random forest, default prediction, SMOTE.
language
English
id
9084730
date added to LUP
2022-10-10 09:33:52
date last changed
2022-10-10 09:33:52
@misc{9084730,
  abstract     = {{The aim of this paper is to compare the performance of the Merton model to a machine learning technique (random forest), in a context where the number of predictors is low or the dataset is quite small. Since random forest is a data-intensive method, the main goal is to find the minimum number of explanatory variables and observations that is needed for it to perform at least as well as the Merton model, an approach developed in the 70s that gives the probability of the firm defaulting. Results suggest that a minimum of 13 predictors is required for both models to have a similar performance and that the dataset should be formed by no less than 9,600 observations for random forest to be as accurate as the classic approach by Merton, providing that the proportion of defaults in the dataset is around 0.18%. Our study also suggests that the application of the over-sampling technique SMOTE combined with 100% under-sampling of the majority class leads to superior results for the random forest, with an accuracy higher than 0.8 as measured by area under curve (AUC). In general, our findings indicate that, although machine learning techniques perform better in absolute terms when predicting corporate defaults, it is important to consider the amount of data available and its quality before trying to apply any of them, since, in extreme data scarcity scenarios, traditional approaches as the Merton model perform better.}},
  author       = {{Díaz García, Aitor and Mirosnikovs, Matiss}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Corporate default prediction: a comparison between Merton model and random forest in an environment of data scarcity}},
  year         = {{2022}},
}