Insurance Fraud Detection: Leveraging Positive and Unlabeled Learning in Imbalanced Data
(2024) DABN01 20241Department of Economics
Department of Statistics
- Abstract
- Insurance fraud data is highly imbalanced and it can be difficult to define what constitutes fraud within the data. In addition, the labeling process of insurance claims occurs as a result of fraud investigations, where only highly suspicious claims are investigated. As a result, the collected data can be categorized as a positively labeled and unlabeled dataset, where positively labeled data points represent investigated claims. Although the data can be categorized as positive and unlabeled data, no previous studies within insurance fraud detection have implemented positive and unlabeled learning methods, leaving a sizable gap within the literature. This study aims to examine possible positive and unlabeled learning methods and identify... (More)
- Insurance fraud data is highly imbalanced and it can be difficult to define what constitutes fraud within the data. In addition, the labeling process of insurance claims occurs as a result of fraud investigations, where only highly suspicious claims are investigated. As a result, the collected data can be categorized as a positively labeled and unlabeled dataset, where positively labeled data points represent investigated claims. Although the data can be categorized as positive and unlabeled data, no previous studies within insurance fraud detection have implemented positive and unlabeled learning methods, leaving a sizable gap within the literature. This study aims to examine possible positive and unlabeled learning methods and identify which approaches are suitable for insurance fraud detection. Models are trained under two assumptions, the selected completely at random (SCAR) assumption, as well as the more general selected at random (SAR) assumption. The models under the SCAR assumption substantially outperformed SAR models. The study found that assumptions under SAR are likely violated, causing model performance to deteriorate. In contrast, although the SCAR assumption is likely violated, model performance was more robust and did not deteriorate. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9165043
- author
- Svensson, Jacob LU
- supervisor
-
- Simon Reese LU
- organization
- course
- DABN01 20241
- year
- 2024
- type
- H1 - Master's Degree (One Year)
- subject
- keywords
- Insurance Fraud, PU Learning, Positive and Unlabeled Learning, Insurance Fraud Detection
- language
- English
- id
- 9165043
- date added to LUP
- 2024-09-24 08:36:27
- date last changed
- 2024-09-24 08:36:27
@misc{9165043, abstract = {{Insurance fraud data is highly imbalanced and it can be difficult to define what constitutes fraud within the data. In addition, the labeling process of insurance claims occurs as a result of fraud investigations, where only highly suspicious claims are investigated. As a result, the collected data can be categorized as a positively labeled and unlabeled dataset, where positively labeled data points represent investigated claims. Although the data can be categorized as positive and unlabeled data, no previous studies within insurance fraud detection have implemented positive and unlabeled learning methods, leaving a sizable gap within the literature. This study aims to examine possible positive and unlabeled learning methods and identify which approaches are suitable for insurance fraud detection. Models are trained under two assumptions, the selected completely at random (SCAR) assumption, as well as the more general selected at random (SAR) assumption. The models under the SCAR assumption substantially outperformed SAR models. The study found that assumptions under SAR are likely violated, causing model performance to deteriorate. In contrast, although the SCAR assumption is likely violated, model performance was more robust and did not deteriorate.}}, author = {{Svensson, Jacob}}, language = {{eng}}, note = {{Student Paper}}, title = {{Insurance Fraud Detection: Leveraging Positive and Unlabeled Learning in Imbalanced Data}}, year = {{2024}}, }