Improving High-Risk Consumer Credit Scoring with Financial Transaction Data

Stålhammar, Jon; Hvarfner, Carl

Improving High-Risk Consumer Credit Scoring with Financial Transaction Data

Mark

Stålhammar, Jon ^LU and Hvarfner, Carl ^LU (2020) In Master's Theses in Mathematical Sciences FMSM01 20201
Mathematical Statistics

Abstract: Credit scoring, the process of evaluating loan applicants with the help of numerical methods, is widely used in the financial sector. When used correctly, it can aid the decision process and lower default rates on loans. New EU regulation has allowed third-party actors to access personal banking information, following the applicant's consent. The increase in available information that financial transaction data provides can be used to improve credit scoring models, and can thus be of great value to lenders. Using a data set from a Swedish consumer institute containing raw transaction data, the authors have tried to extract useful features which can improve credit scoring precision. The authors describe transaction categorization, feature... (More); Credit scoring, the process of evaluating loan applicants with the help of numerical methods, is widely used in the financial sector. When used correctly, it can aid the decision process and lower default rates on loans. New EU regulation has allowed third-party actors to access personal banking information, following the applicant's consent. The increase in available information that financial transaction data provides can be used to improve credit scoring models, and can thus be of great value to lenders. Using a data set from a Swedish consumer institute containing raw transaction data, the authors have tried to extract useful features which can improve credit scoring precision. The authors describe transaction categorization, feature construction and selection as well as model evaluation. A large number of features were created, which lead to a substantial and involved feature reduction process. Four types of models were evaluated: a Logistic Regression, a Random Forest, an XGBoost and a Neural Network. As reference, the company's old model, a Logistic Regression using socioeconomic factors, was used. After careful evaluation using Bayesian Optimization for hyperparameter tuning, a combination of Logistic Regression models were considered to be the best and most consistent. Of the transaction-derived features evaluated, bailiff expenses, salary and ATM withdrawals were considered to be the most influential across all models. Compared to an optimized Logistic Regression base model, a four percentage point average precision improvement could be observed in the final model. (Less)
Popular Abstract: Even though machine learning models have been around for decades, they have risen quickly in popularity thanks to large amounts of data and faster computer processors. Credit institutes, and particularly those who give out consumer loans, are part of those wanting to utilize machine learning in their day-to-day operations to eliminate subjective opinions and boost their profits. Human based decision always inject subjective opinions, based on their personal experiences, and which often are not true. By utilizing automated computer processes, prediction if a customer will not be able to pay their loan, can be done faster and with higher precision. Meanwhile, thanks to new EU regulations, third-parties are allowed at customers' consent to... (More); Even though machine learning models have been around for decades, they have risen quickly in popularity thanks to large amounts of data and faster computer processors. Credit institutes, and particularly those who give out consumer loans, are part of those wanting to utilize machine learning in their day-to-day operations to eliminate subjective opinions and boost their profits. Human based decision always inject subjective opinions, based on their personal experiences, and which often are not true. By utilizing automated computer processes, prediction if a customer will not be able to pay their loan, can be done faster and with higher precision. Meanwhile, thanks to new EU regulations, third-parties are allowed at customers' consent to analyze their financial transactions to produce better services. This thesis connects these two dots, machine learning and financial transactions, to improve what is called credit scoring. Multiple models have been evaluated and it can be concluded that the more complex models do not triumph a simple statistical model (Logistic Regression). However, this might change as financial transactions are further evaluated as such data become more widely available. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9008737

author

Stålhammar, Jon ^LU and Hvarfner, Carl ^LU

supervisor

Erik Lindström ^LU

organization

Mathematical Statistics

course

FMSM01 20201

year

2020

type

H2 - Master's Degree (Two Years)

subject

Mathematics and Statistics

keywords

Machine Learning, Scorecard Modelling, Feature Engineering, Feature Selection, Average Precision Score, SHAP, SMOTE, Logistic Regression, Random Forest, XGBoost, Artificial Neural Network

publication/series

Master's Theses in Mathematical Sciences

report number

LUTFMS-3393-2020

ISSN

1404-6342

other publication id

2020:E53

language

English

id

9008737

date added to LUP

2020-10-05 13:09:48

date last changed

2021-06-04 17:22:18

@misc{9008737,
  abstract     = {{Credit scoring, the process of evaluating loan applicants with the help of numerical methods, is widely used in the financial sector. When used correctly, it can aid the decision process and lower default rates on loans. New EU regulation has allowed third-party actors to access personal banking information, following the applicant's consent. The increase in available information that financial transaction data provides can be used to improve credit scoring models, and can thus be of great value to lenders. Using a data set from a Swedish consumer institute containing raw transaction data, the authors have tried to extract useful features which can improve credit scoring precision. The authors describe transaction categorization, feature construction and selection as well as model evaluation. A large number of features were created, which lead to a substantial and involved feature reduction process. Four types of models were evaluated: a Logistic Regression, a Random Forest, an XGBoost and a Neural Network. As reference, the company's old model, a Logistic Regression using socioeconomic factors, was used. After careful evaluation using Bayesian Optimization for hyperparameter tuning, a combination of Logistic Regression models were considered to be the best and most consistent. Of the transaction-derived features evaluated, bailiff expenses, salary and ATM withdrawals were considered to be the most influential across all models. Compared to an optimized Logistic Regression base model, a four percentage point average precision improvement could be observed in the final model.}},
  author       = {{Stålhammar, Jon and Hvarfner, Carl}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Theses in Mathematical Sciences}},
  title        = {{Improving High-Risk Consumer Credit Scoring with Financial Transaction Data}},
  year         = {{2020}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Improving High-Risk Consumer Credit Scoring with Financial Transaction Data