Insurance Fraud Detection: Leveraging Positive and Unlabeled Learning in Imbalanced Data

Svensson, Jacob

Insurance Fraud Detection: Leveraging Positive and Unlabeled Learning in Imbalanced Data

Mark

Svensson, Jacob ^LU (2024) DABN01 20241
Department of Economics
Department of Statistics

Abstract: Insurance fraud data is highly imbalanced and it can be difficult to define what constitutes fraud within the data. In addition, the labeling process of insurance claims occurs as a result of fraud investigations, where only highly suspicious claims are investigated. As a result, the collected data can be categorized as a positively labeled and unlabeled dataset, where positively labeled data points represent investigated claims. Although the data can be categorized as positive and unlabeled data, no previous studies within insurance fraud detection have implemented positive and unlabeled learning methods, leaving a sizable gap within the literature. This study aims to examine possible positive and unlabeled learning methods and identify... (More); Insurance fraud data is highly imbalanced and it can be difficult to define what constitutes fraud within the data. In addition, the labeling process of insurance claims occurs as a result of fraud investigations, where only highly suspicious claims are investigated. As a result, the collected data can be categorized as a positively labeled and unlabeled dataset, where positively labeled data points represent investigated claims. Although the data can be categorized as positive and unlabeled data, no previous studies within insurance fraud detection have implemented positive and unlabeled learning methods, leaving a sizable gap within the literature. This study aims to examine possible positive and unlabeled learning methods and identify which approaches are suitable for insurance fraud detection. Models are trained under two assumptions, the selected completely at random (SCAR) assumption, as well as the more general selected at random (SAR) assumption. The models under the SCAR assumption substantially outperformed SAR models. The study found that assumptions under SAR are likely violated, causing model performance to deteriorate. In contrast, although the SCAR assumption is likely violated, model performance was more robust and did not deteriorate. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9165043

author

Svensson, Jacob ^LU

supervisor

Simon Reese ^LU

organization

course

DABN01 20241

year

2024

type

H1 - Master's Degree (One Year)

subject

Business and Economics

keywords

Insurance Fraud, PU Learning, Positive and Unlabeled Learning, Insurance Fraud Detection

language

English

id

9165043

date added to LUP

2024-09-24 08:36:27

date last changed

2024-09-24 08:36:27

@misc{9165043,
  abstract     = {{Insurance fraud data is highly imbalanced and it can be difficult to define what constitutes fraud within the data. In addition, the labeling process of insurance claims occurs as a result of fraud investigations, where only highly suspicious claims are investigated. As a result, the collected data can be categorized as a positively labeled and unlabeled dataset, where positively labeled data points represent investigated claims. Although the data can be categorized as positive and unlabeled data, no previous studies within insurance fraud detection have implemented positive and unlabeled learning methods, leaving a sizable gap within the literature. This study aims to examine possible positive and unlabeled learning methods and identify which approaches are suitable for insurance fraud detection. Models are trained under two assumptions, the selected completely at random (SCAR) assumption, as well as the more general selected at random (SAR) assumption. The models under the SCAR assumption substantially outperformed SAR models. The study found that assumptions under SAR are likely violated, causing model performance to deteriorate. In contrast, although the SCAR assumption is likely violated, model performance was more robust and did not deteriorate.}},
  author       = {{Svensson, Jacob}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Insurance Fraud Detection: Leveraging Positive and Unlabeled Learning in Imbalanced Data}},
  year         = {{2024}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Insurance Fraud Detection: Leveraging Positive and Unlabeled Learning in Imbalanced Data