Advanced

Improving High-Risk Consumer Credit Scoring with Financial Transaction Data

Stålhammar, Jon LU and Hvarfner, Carl LU (2020) In LUTFMS-3393-2020 FMSM01 20201
Mathematical Statistics
Abstract
Credit scoring, the process of evaluating loan applicants with the help of numerical methods, is widely used in the financial sector. When used correctly, it can aid the decision process and lower default rates on loans. New EU regulation has allowed third-party actors to access personal banking information, following the applicant's consent. The increase in available information that financial transaction data provides can be used to improve credit scoring models, and can thus be of great value to lenders. Using a data set from a Swedish consumer institute containing raw transaction data, the authors have tried to extract useful features which can improve credit scoring precision. The authors describe transaction categorization, feature... (More)
Credit scoring, the process of evaluating loan applicants with the help of numerical methods, is widely used in the financial sector. When used correctly, it can aid the decision process and lower default rates on loans. New EU regulation has allowed third-party actors to access personal banking information, following the applicant's consent. The increase in available information that financial transaction data provides can be used to improve credit scoring models, and can thus be of great value to lenders. Using a data set from a Swedish consumer institute containing raw transaction data, the authors have tried to extract useful features which can improve credit scoring precision. The authors describe transaction categorization, feature construction and selection as well as model evaluation. A large number of features were created, which lead to a substantial and involved feature reduction process. Four types of models were evaluated: a Logistic Regression, a Random Forest, an XGBoost and a Neural Network. As reference, the company's old model, a Logistic Regression using socioeconomic factors, was used. After careful evaluation using Bayesian Optimization for hyperparameter tuning, a combination of Logistic Regression models were considered to be the best and most consistent. Of the transaction-derived features evaluated, bailiff expenses, salary and ATM withdrawals were considered to be the most influential across all models. Compared to an optimized Logistic Regression base model, a four percentage point average precision improvement could be observed in the final model. (Less)
Popular Abstract
Even though machine learning models have been around for decades, they have risen quickly in popularity thanks to large amounts of data and faster computer processors. Credit institutes, and particularly those who give out consumer loans, are part of those wanting to utilize machine learning in their day-to-day operations to eliminate subjective opinions and boost their profits. Human based decision always inject subjective opinions, based on their personal experiences, and which often are not true. By utilizing automated computer processes, prediction if a customer will not be able to pay their loan, can be done faster and with higher precision. Meanwhile, thanks to new EU regulations, third-parties are allowed at customers' consent to... (More)
Even though machine learning models have been around for decades, they have risen quickly in popularity thanks to large amounts of data and faster computer processors. Credit institutes, and particularly those who give out consumer loans, are part of those wanting to utilize machine learning in their day-to-day operations to eliminate subjective opinions and boost their profits. Human based decision always inject subjective opinions, based on their personal experiences, and which often are not true. By utilizing automated computer processes, prediction if a customer will not be able to pay their loan, can be done faster and with higher precision. Meanwhile, thanks to new EU regulations, third-parties are allowed at customers' consent to analyze their financial transactions to produce better services. This thesis connects these two dots, machine learning and financial transactions, to improve what is called credit scoring. Multiple models have been evaluated and it can be concluded that the more complex models do not triumph a simple statistical model (Logistic Regression). However, this might change as financial transactions are further evaluated as such data become more widely available. (Less)
Please use this url to cite or link to this publication:
author
Stålhammar, Jon LU and Hvarfner, Carl LU
supervisor
organization
course
FMSM01 20201
year
type
H2 - Master's Degree (Two Years)
subject
keywords
Machine Learning, Scorecard Modelling, Feature Engineering, Feature Selection, Average Precision Score, SHAP, SMOTE, Logistic Regression, Random Forest, XGBoost, Artificial Neural Network
publication/series
LUTFMS-3393-2020
report number
2020:E53
ISSN
1404-6342
language
English
id
9008737
date added to LUP
2020-10-05 13:09:48
date last changed
2020-10-05 13:09:48
@misc{9008737,
  abstract     = {Credit scoring, the process of evaluating loan applicants with the help of numerical methods, is widely used in the financial sector. When used correctly, it can aid the decision process and lower default rates on loans. New EU regulation has allowed third-party actors to access personal banking information, following the applicant's consent. The increase in available information that financial transaction data provides can be used to improve credit scoring models, and can thus be of great value to lenders. Using a data set from a Swedish consumer institute containing raw transaction data, the authors have tried to extract useful features which can improve credit scoring precision. The authors describe transaction categorization, feature construction and selection as well as model evaluation. A large number of features were created, which lead to a substantial and involved feature reduction process. Four types of models were evaluated: a Logistic Regression, a Random Forest, an XGBoost and a Neural Network. As reference, the company's old model, a Logistic Regression using socioeconomic factors, was used. After careful evaluation using Bayesian Optimization for hyperparameter tuning, a combination of Logistic Regression models were considered to be the best and most consistent. Of the transaction-derived features evaluated, bailiff expenses, salary and ATM withdrawals were considered to be the most influential across all models. Compared to an optimized Logistic Regression base model, a four percentage point average precision improvement could be observed in the final model.},
  author       = {Stålhammar, Jon and Hvarfner, Carl},
  issn         = {1404-6342},
  keyword      = {Machine Learning,Scorecard Modelling,Feature Engineering,Feature Selection,Average Precision Score,SHAP,SMOTE,Logistic Regression,Random Forest,XGBoost,Artificial Neural Network},
  language     = {eng},
  note         = {Student Paper},
  series       = {LUTFMS-3393-2020},
  title        = {Improving High-Risk Consumer Credit Scoring with Financial Transaction Data},
  year         = {2020},
}