Imbalanced Predictions
(2022) STAN40 20221Department of Statistics
- Abstract
- The aim of the thesis is to evaluate solutions to the class imbalance problem using real world data sets with varying degrees of class imbalance. The analysis is limited to binary classification. Three large data sets relating to credit card fraud, vehicle insurance and heart disease are used for the analysis.
Several methods are compared and evaluated. Logistic regression, SVC and decision trees are used as benchmark classifiers in order to compare these to imbalanced learning techniques. Random undersampling and SMOTE are used to to evaluate resampling techniques. Cost-sensitive versions of logistic regression, SVC and decision trees are used to evaluate cost-sensitive algorithms. The resampling techniques are also used in combination... (More) - The aim of the thesis is to evaluate solutions to the class imbalance problem using real world data sets with varying degrees of class imbalance. The analysis is limited to binary classification. Three large data sets relating to credit card fraud, vehicle insurance and heart disease are used for the analysis.
Several methods are compared and evaluated. Logistic regression, SVC and decision trees are used as benchmark classifiers in order to compare these to imbalanced learning techniques. Random undersampling and SMOTE are used to to evaluate resampling techniques. Cost-sensitive versions of logistic regression, SVC and decision trees are used to evaluate cost-sensitive algorithms. The resampling techniques are also used in combination with the cost-sensitive algorithms. The results are evaluated using six measures: accuracy, recall, precision, F-measure, G-mean and AUC.
The conclusion of the thesis is that none of the methods evaluated outperforms all others. Depending on the data set used for analysis, the methods produced varying scores for the different evaluation measures. As an example of this, the method used to produce the highest precision score was not the same for the credit card fraud detection data and for the heart disease data. The analysis further showed that which evaluation measure to use depends on the goal of the analysis.
This shows that none of the evaluated techniques are optimal for all data sets. Depending on the data set used and the goals of the analysis, different methods and evaluation measures may be applied. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9089954
- author
- Säfström, Stella LU
- supervisor
- organization
- alternative title
- An Evaluation of Classification Techniques for Imbalanced Data
- course
- STAN40 20221
- year
- 2022
- type
- H1 - Master's Degree (One Year)
- subject
- keywords
- Imbalanced data, cost-sensitive learning, SMOTE, random undersampling
- language
- English
- id
- 9089954
- date added to LUP
- 2023-02-14 11:47:15
- date last changed
- 2023-02-14 11:47:15
@misc{9089954, abstract = {{The aim of the thesis is to evaluate solutions to the class imbalance problem using real world data sets with varying degrees of class imbalance. The analysis is limited to binary classification. Three large data sets relating to credit card fraud, vehicle insurance and heart disease are used for the analysis. Several methods are compared and evaluated. Logistic regression, SVC and decision trees are used as benchmark classifiers in order to compare these to imbalanced learning techniques. Random undersampling and SMOTE are used to to evaluate resampling techniques. Cost-sensitive versions of logistic regression, SVC and decision trees are used to evaluate cost-sensitive algorithms. The resampling techniques are also used in combination with the cost-sensitive algorithms. The results are evaluated using six measures: accuracy, recall, precision, F-measure, G-mean and AUC. The conclusion of the thesis is that none of the methods evaluated outperforms all others. Depending on the data set used for analysis, the methods produced varying scores for the different evaluation measures. As an example of this, the method used to produce the highest precision score was not the same for the credit card fraud detection data and for the heart disease data. The analysis further showed that which evaluation measure to use depends on the goal of the analysis. This shows that none of the evaluated techniques are optimal for all data sets. Depending on the data set used and the goals of the analysis, different methods and evaluation measures may be applied.}}, author = {{Säfström, Stella}}, language = {{eng}}, note = {{Student Paper}}, title = {{Imbalanced Predictions}}, year = {{2022}}, }