Corporate default prediction: a comparison between Merton model and random forest in an environment of data scarcity

Díaz García, Aitor; Mirosnikovs, Matiss

Corporate default prediction: a comparison between Merton model and random forest in an environment of data scarcity

Mark

Díaz García, Aitor ^LU and Mirosnikovs, Matiss ^LU (2022) NEKN02 20221
Department of Economics

Abstract: The aim of this paper is to compare the performance of the Merton model to a machine learning technique (random forest), in a context where the number of predictors is low or the dataset is quite small. Since random forest is a data-intensive method, the main goal is to find the minimum number of explanatory variables and observations that is needed for it to perform at least as well as the Merton model, an approach developed in the 70s that gives the probability of the firm defaulting. Results suggest that a minimum of 13 predictors is required for both models to have a similar performance and that the dataset should be formed by no less than 9,600 observations for random forest to be as accurate as the classic approach by Merton,... (More); The aim of this paper is to compare the performance of the Merton model to a machine learning technique (random forest), in a context where the number of predictors is low or the dataset is quite small. Since random forest is a data-intensive method, the main goal is to find the minimum number of explanatory variables and observations that is needed for it to perform at least as well as the Merton model, an approach developed in the 70s that gives the probability of the firm defaulting. Results suggest that a minimum of 13 predictors is required for both models to have a similar performance and that the dataset should be formed by no less than 9,600 observations for random forest to be as accurate as the classic approach by Merton, providing that the proportion of defaults in the dataset is around 0.18%. Our study also suggests that the application of the over-sampling technique SMOTE combined with 100% under-sampling of the majority class leads to superior results for the random forest, with an accuracy higher than 0.8 as measured by area under curve (AUC). In general, our findings indicate that, although machine learning techniques perform better in absolute terms when predicting corporate defaults, it is important to consider the amount of data available and its quality before trying to apply any of them, since, in extreme data scarcity scenarios, traditional approaches as the Merton model perform better. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9084730

author

Díaz García, Aitor ^LU and Mirosnikovs, Matiss ^LU

supervisor

Anders Vilhelmsson ^LU

organization

Department of Economics

course

NEKN02 20221

year

2022

type

H1 - Master's Degree (One Year)

subject

Business and Economics

keywords

Merton model, random forest, default prediction, SMOTE.

language

English

id

9084730

date added to LUP

2022-10-10 09:33:52

date last changed

2022-10-10 09:33:52

@misc{9084730,
  abstract     = {{The aim of this paper is to compare the performance of the Merton model to a machine learning technique (random forest), in a context where the number of predictors is low or the dataset is quite small. Since random forest is a data-intensive method, the main goal is to find the minimum number of explanatory variables and observations that is needed for it to perform at least as well as the Merton model, an approach developed in the 70s that gives the probability of the firm defaulting. Results suggest that a minimum of 13 predictors is required for both models to have a similar performance and that the dataset should be formed by no less than 9,600 observations for random forest to be as accurate as the classic approach by Merton, providing that the proportion of defaults in the dataset is around 0.18%. Our study also suggests that the application of the over-sampling technique SMOTE combined with 100% under-sampling of the majority class leads to superior results for the random forest, with an accuracy higher than 0.8 as measured by area under curve (AUC). In general, our findings indicate that, although machine learning techniques perform better in absolute terms when predicting corporate defaults, it is important to consider the amount of data available and its quality before trying to apply any of them, since, in extreme data scarcity scenarios, traditional approaches as the Merton model perform better.}},
  author       = {{Díaz García, Aitor and Mirosnikovs, Matiss}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Corporate default prediction: a comparison between Merton model and random forest in an environment of data scarcity}},
  year         = {{2022}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Corporate default prediction: a comparison between Merton model and random forest in an environment of data scarcity