Imbalanced Predictions

Säfström, Stella

Imbalanced Predictions

Mark

Säfström, Stella ^LU (2022) STAN40 20221
Department of Statistics

Abstract: The aim of the thesis is to evaluate solutions to the class imbalance problem using real world data sets with varying degrees of class imbalance. The analysis is limited to binary classification. Three large data sets relating to credit card fraud, vehicle insurance and heart disease are used for the analysis.

Several methods are compared and evaluated. Logistic regression, SVC and decision trees are used as benchmark classifiers in order to compare these to imbalanced learning techniques. Random undersampling and SMOTE are used to to evaluate resampling techniques. Cost-sensitive versions of logistic regression, SVC and decision trees are used to evaluate cost-sensitive algorithms. The resampling techniques are also used in combination... (More); The aim of the thesis is to evaluate solutions to the class imbalance problem using real world data sets with varying degrees of class imbalance. The analysis is limited to binary classification. Three large data sets relating to credit card fraud, vehicle insurance and heart disease are used for the analysis.

Several methods are compared and evaluated. Logistic regression, SVC and decision trees are used as benchmark classifiers in order to compare these to imbalanced learning techniques. Random undersampling and SMOTE are used to to evaluate resampling techniques. Cost-sensitive versions of logistic regression, SVC and decision trees are used to evaluate cost-sensitive algorithms. The resampling techniques are also used in combination with the cost-sensitive algorithms. The results are evaluated using six measures: accuracy, recall, precision, F-measure, G-mean and AUC.

The conclusion of the thesis is that none of the methods evaluated outperforms all others. Depending on the data set used for analysis, the methods produced varying scores for the different evaluation measures. As an example of this, the method used to produce the highest precision score was not the same for the credit card fraud detection data and for the heart disease data. The analysis further showed that which evaluation measure to use depends on the goal of the analysis.

This shows that none of the evaluated techniques are optimal for all data sets. Depending on the data set used and the goals of the analysis, different methods and evaluation measures may be applied. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9089954

author

Säfström, Stella ^LU

supervisor

Jakob Bergman ^LU

organization

Department of Statistics

alternative title

An Evaluation of Classification Techniques for Imbalanced Data

course

STAN40 20221

year

2022

type

H1 - Master's Degree (One Year)

subject

Mathematics and Statistics

keywords

Imbalanced data, cost-sensitive learning, SMOTE, random undersampling

language

English

id

9089954

date added to LUP

2023-02-14 11:47:15

date last changed

2023-02-14 11:47:15

@misc{9089954,
  abstract     = {{The aim of the thesis is to evaluate solutions to the class imbalance problem using real world data sets with varying degrees of class imbalance. The analysis is limited to binary classification. Three large data sets relating to credit card fraud, vehicle insurance and heart disease are used for the analysis.

Several methods are compared and evaluated. Logistic regression, SVC and decision trees are used as benchmark classifiers in order to compare these to imbalanced learning techniques. Random undersampling and SMOTE are used to to evaluate resampling techniques. Cost-sensitive versions of logistic regression, SVC and decision trees are used to evaluate cost-sensitive algorithms. The resampling techniques are also used in combination with the cost-sensitive algorithms. The results are evaluated using six measures: accuracy, recall, precision, F-measure, G-mean and AUC.

The conclusion of the thesis is that none of the methods evaluated outperforms all others. Depending on the data set used for analysis, the methods produced varying scores for the different evaluation measures. As an example of this, the method used to produce the highest precision score was not the same for the credit card fraud detection data and for the heart disease data. The analysis further showed that which evaluation measure to use depends on the goal of the analysis.

This shows that none of the evaluated techniques are optimal for all data sets. Depending on the data set used and the goals of the analysis, different methods and evaluation measures may be applied.}},
  author       = {{Säfström, Stella}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Imbalanced Predictions}},
  year         = {{2022}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Imbalanced Predictions