Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Insurance Fraud Detection: Leveraging Positive and Unlabeled Learning in Imbalanced Data

Svensson, Jacob LU (2024) DABN01 20241
Department of Economics
Department of Statistics
Abstract
Insurance fraud data is highly imbalanced and it can be difficult to define what constitutes fraud within the data. In addition, the labeling process of insurance claims occurs as a result of fraud investigations, where only highly suspicious claims are investigated. As a result, the collected data can be categorized as a positively labeled and unlabeled dataset, where positively labeled data points represent investigated claims. Although the data can be categorized as positive and unlabeled data, no previous studies within insurance fraud detection have implemented positive and unlabeled learning methods, leaving a sizable gap within the literature. This study aims to examine possible positive and unlabeled learning methods and identify... (More)
Insurance fraud data is highly imbalanced and it can be difficult to define what constitutes fraud within the data. In addition, the labeling process of insurance claims occurs as a result of fraud investigations, where only highly suspicious claims are investigated. As a result, the collected data can be categorized as a positively labeled and unlabeled dataset, where positively labeled data points represent investigated claims. Although the data can be categorized as positive and unlabeled data, no previous studies within insurance fraud detection have implemented positive and unlabeled learning methods, leaving a sizable gap within the literature. This study aims to examine possible positive and unlabeled learning methods and identify which approaches are suitable for insurance fraud detection. Models are trained under two assumptions, the selected completely at random (SCAR) assumption, as well as the more general selected at random (SAR) assumption. The models under the SCAR assumption substantially outperformed SAR models. The study found that assumptions under SAR are likely violated, causing model performance to deteriorate. In contrast, although the SCAR assumption is likely violated, model performance was more robust and did not deteriorate. (Less)
Please use this url to cite or link to this publication:
author
Svensson, Jacob LU
supervisor
organization
course
DABN01 20241
year
type
H1 - Master's Degree (One Year)
subject
keywords
Insurance Fraud, PU Learning, Positive and Unlabeled Learning, Insurance Fraud Detection
language
English
id
9165043
date added to LUP
2024-09-24 08:36:27
date last changed
2024-09-24 08:36:27
@misc{9165043,
  abstract     = {{Insurance fraud data is highly imbalanced and it can be difficult to define what constitutes fraud within the data. In addition, the labeling process of insurance claims occurs as a result of fraud investigations, where only highly suspicious claims are investigated. As a result, the collected data can be categorized as a positively labeled and unlabeled dataset, where positively labeled data points represent investigated claims. Although the data can be categorized as positive and unlabeled data, no previous studies within insurance fraud detection have implemented positive and unlabeled learning methods, leaving a sizable gap within the literature. This study aims to examine possible positive and unlabeled learning methods and identify which approaches are suitable for insurance fraud detection. Models are trained under two assumptions, the selected completely at random (SCAR) assumption, as well as the more general selected at random (SAR) assumption. The models under the SCAR assumption substantially outperformed SAR models. The study found that assumptions under SAR are likely violated, causing model performance to deteriorate. In contrast, although the SCAR assumption is likely violated, model performance was more robust and did not deteriorate.}},
  author       = {{Svensson, Jacob}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Insurance Fraud Detection: Leveraging Positive and Unlabeled Learning in Imbalanced Data}},
  year         = {{2024}},
}