Improving High-Risk Consumer Credit Scoring with Financial Transaction Data
(2020) In Master's Theses in Mathematical Sciences FMSM01 20201Mathematical Statistics
- Abstract
- Credit scoring, the process of evaluating loan applicants with the help of numerical methods, is widely used in the financial sector. When used correctly, it can aid the decision process and lower default rates on loans. New EU regulation has allowed third-party actors to access personal banking information, following the applicant's consent. The increase in available information that financial transaction data provides can be used to improve credit scoring models, and can thus be of great value to lenders. Using a data set from a Swedish consumer institute containing raw transaction data, the authors have tried to extract useful features which can improve credit scoring precision. The authors describe transaction categorization, feature... (More)
- Credit scoring, the process of evaluating loan applicants with the help of numerical methods, is widely used in the financial sector. When used correctly, it can aid the decision process and lower default rates on loans. New EU regulation has allowed third-party actors to access personal banking information, following the applicant's consent. The increase in available information that financial transaction data provides can be used to improve credit scoring models, and can thus be of great value to lenders. Using a data set from a Swedish consumer institute containing raw transaction data, the authors have tried to extract useful features which can improve credit scoring precision. The authors describe transaction categorization, feature construction and selection as well as model evaluation. A large number of features were created, which lead to a substantial and involved feature reduction process. Four types of models were evaluated: a Logistic Regression, a Random Forest, an XGBoost and a Neural Network. As reference, the company's old model, a Logistic Regression using socioeconomic factors, was used. After careful evaluation using Bayesian Optimization for hyperparameter tuning, a combination of Logistic Regression models were considered to be the best and most consistent. Of the transaction-derived features evaluated, bailiff expenses, salary and ATM withdrawals were considered to be the most influential across all models. Compared to an optimized Logistic Regression base model, a four percentage point average precision improvement could be observed in the final model. (Less)
- Popular Abstract
- Even though machine learning models have been around for decades, they have risen quickly in popularity thanks to large amounts of data and faster computer processors. Credit institutes, and particularly those who give out consumer loans, are part of those wanting to utilize machine learning in their day-to-day operations to eliminate subjective opinions and boost their profits. Human based decision always inject subjective opinions, based on their personal experiences, and which often are not true. By utilizing automated computer processes, prediction if a customer will not be able to pay their loan, can be done faster and with higher precision. Meanwhile, thanks to new EU regulations, third-parties are allowed at customers' consent to... (More)
- Even though machine learning models have been around for decades, they have risen quickly in popularity thanks to large amounts of data and faster computer processors. Credit institutes, and particularly those who give out consumer loans, are part of those wanting to utilize machine learning in their day-to-day operations to eliminate subjective opinions and boost their profits. Human based decision always inject subjective opinions, based on their personal experiences, and which often are not true. By utilizing automated computer processes, prediction if a customer will not be able to pay their loan, can be done faster and with higher precision. Meanwhile, thanks to new EU regulations, third-parties are allowed at customers' consent to analyze their financial transactions to produce better services. This thesis connects these two dots, machine learning and financial transactions, to improve what is called credit scoring. Multiple models have been evaluated and it can be concluded that the more complex models do not triumph a simple statistical model (Logistic Regression). However, this might change as financial transactions are further evaluated as such data become more widely available. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9008737
- author
- Stålhammar, Jon LU and Hvarfner, Carl LU
- supervisor
- organization
- course
- FMSM01 20201
- year
- 2020
- type
- H2 - Master's Degree (Two Years)
- subject
- keywords
- Machine Learning, Scorecard Modelling, Feature Engineering, Feature Selection, Average Precision Score, SHAP, SMOTE, Logistic Regression, Random Forest, XGBoost, Artificial Neural Network
- publication/series
- Master's Theses in Mathematical Sciences
- report number
- LUTFMS-3393-2020
- ISSN
- 1404-6342
- other publication id
- 2020:E53
- language
- English
- id
- 9008737
- date added to LUP
- 2020-10-05 13:09:48
- date last changed
- 2021-06-04 17:22:18
@misc{9008737, abstract = {{Credit scoring, the process of evaluating loan applicants with the help of numerical methods, is widely used in the financial sector. When used correctly, it can aid the decision process and lower default rates on loans. New EU regulation has allowed third-party actors to access personal banking information, following the applicant's consent. The increase in available information that financial transaction data provides can be used to improve credit scoring models, and can thus be of great value to lenders. Using a data set from a Swedish consumer institute containing raw transaction data, the authors have tried to extract useful features which can improve credit scoring precision. The authors describe transaction categorization, feature construction and selection as well as model evaluation. A large number of features were created, which lead to a substantial and involved feature reduction process. Four types of models were evaluated: a Logistic Regression, a Random Forest, an XGBoost and a Neural Network. As reference, the company's old model, a Logistic Regression using socioeconomic factors, was used. After careful evaluation using Bayesian Optimization for hyperparameter tuning, a combination of Logistic Regression models were considered to be the best and most consistent. Of the transaction-derived features evaluated, bailiff expenses, salary and ATM withdrawals were considered to be the most influential across all models. Compared to an optimized Logistic Regression base model, a four percentage point average precision improvement could be observed in the final model.}}, author = {{Stålhammar, Jon and Hvarfner, Carl}}, issn = {{1404-6342}}, language = {{eng}}, note = {{Student Paper}}, series = {{Master's Theses in Mathematical Sciences}}, title = {{Improving High-Risk Consumer Credit Scoring with Financial Transaction Data}}, year = {{2020}}, }