Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Detection of insurance fraud using NLP and ML

Bäcklund, Rasmus LU and Öhman, Hampus (2023) In Master's Theses in Mathematical Sciences FMSM01 20231
Mathematical Statistics
Abstract
Machine-Learning can sometimes see things we as humans can not. In this thesis we evaluated three different Natural Language Procces-techniques: BERT, word2vec and linguistic analysis (UDPipe), on their performance in detecting insurance fraud based on transcribed audio from phone calls (referred to as audio data) and written text (referred to as text-form data), related to insurance claims. We also included TF-IDF as a naive model. On all models we applied logistic regression on the extracted word embeddings. On word2vec and the linguistic analysis, we also applied a KNN-classifier on the word embeddings. For BERT, we instead opted to apply an LSTM-network on the merged CLS-token embeddings, due to the sequential nature of BERT’s... (More)
Machine-Learning can sometimes see things we as humans can not. In this thesis we evaluated three different Natural Language Procces-techniques: BERT, word2vec and linguistic analysis (UDPipe), on their performance in detecting insurance fraud based on transcribed audio from phone calls (referred to as audio data) and written text (referred to as text-form data), related to insurance claims. We also included TF-IDF as a naive model. On all models we applied logistic regression on the extracted word embeddings. On word2vec and the linguistic analysis, we also applied a KNN-classifier on the word embeddings. For BERT, we instead opted to apply an LSTM-network on the merged CLS-token embeddings, due to the sequential nature of BERT’s architecture.

For the audio data, all models achieved a Macro F1-score higher than 50% on a 95%-confidence interval, with at least one type of classifier. TF-IDF scored 58.2% ±2.6%, BERT 56.0% ±2.6%, word2vec 54.1% ±3.8% and linguistic analysis 53.6% ±3.0%.

For the text-form data, all models achieved a Macro F1-score higher than 50% on a 95%-confidence interval, with at least one type of classifier. TF-IDF scored
56.0% ±2.3%, BERT 57.4% ±0.9%, word2vec 56.0 ±2.1% and linguistic analysis 51.4% ±0.5%.

Each score reported is from using the best performing classifier for that model.

The above findings show that our models manage to learn something from the data, but due to rather small data sets and insurance cases from many different areas, it is quite difficult to draw any conclusions with high confidence. The results are not that much better than "guessing", and the small gain over 50% could be due to something else, such as bias in the data sets.

We feel that there is potential to use these techniques in a real setting, but the topic seems to need further exploration. We especially feel that there is potential in using transformer-based models, such as BERT, but currently it lacks the ability to analyse longer sequences due to computational limitations. With the current development pace of transformer models, it might be possible to use these in the future to get a better representation of what is being said, which hopefully would produce better results. (Less)
Popular Abstract (Swedish)
Försäkringsbolag behöver identifiera och hantera en stor mängd potentiellt bedrägliga försäkringsanspråk från deras kunder. Därför har försäkringsbolag en rad system för att försöka identifiera vilka anspråk som är bedrägliga och vilka som inte är det. Med den senaste utvecklingen inom maskininlärning och
särskilt inom området för språkbehandling (NLP) kan dessa tekniker eventuellt användas för att underlätta identifiering av bedrägliga anspråk.

Syftet med denna avhandling är därför att undersöka möjligheten att använda
NLP för det ovan nämnda syftet. Baserat på transkriberade samtal mellan kunder och företagsrepresentanter, samt på text skriven fritt av kunder eller som svar
på olika frågor (beroende på anspråkstypen) kommer vi... (More)
Försäkringsbolag behöver identifiera och hantera en stor mängd potentiellt bedrägliga försäkringsanspråk från deras kunder. Därför har försäkringsbolag en rad system för att försöka identifiera vilka anspråk som är bedrägliga och vilka som inte är det. Med den senaste utvecklingen inom maskininlärning och
särskilt inom området för språkbehandling (NLP) kan dessa tekniker eventuellt användas för att underlätta identifiering av bedrägliga anspråk.

Syftet med denna avhandling är därför att undersöka möjligheten att använda
NLP för det ovan nämnda syftet. Baserat på transkriberade samtal mellan kunder och företagsrepresentanter, samt på text skriven fritt av kunder eller som svar
på olika frågor (beroende på anspråkstypen) kommer vi att träna flera modeller
för att förutspå sannolikheten att ett anspråk är försäkringsbedrägeri. De tre
olika modellerna som används i detta projekt är BERT, word2vec och lingvistik analys.

Modellerna kommer att skapa numeriska representationer av dessa texter, vilka
sedan kommer att klassificeras med hjälp av olika tekniker inom maskininlärning. Slutligen kommer prestandan för varje modell att utvärderas för att avgöra om detta är ett användbart hjälpmedel i processen att undersöka potentiellt bedrägliga anspråk. (Less)
Please use this url to cite or link to this publication:
author
Bäcklund, Rasmus LU and Öhman, Hampus
supervisor
organization
alternative title
Detektion av försäkringsbedrägeri genom användning av NLP och ML
course
FMSM01 20231
year
type
H2 - Master's Degree (Two Years)
subject
publication/series
Master's Theses in Mathematical Sciences
report number
LUTFMS-3489-2023
ISSN
1404-6342
other publication id
2023:E64
language
English
id
9131078
date added to LUP
2023-06-28 15:23:50
date last changed
2023-07-03 14:11:17
@misc{9131078,
  abstract     = {{Machine-Learning can sometimes see things we as humans can not. In this thesis we evaluated three different Natural Language Procces-techniques: BERT, word2vec and linguistic analysis (UDPipe), on their performance in detecting insurance fraud based on transcribed audio from phone calls (referred to as audio data) and written text (referred to as text-form data), related to insurance claims. We also included TF-IDF as a naive model. On all models we applied logistic regression on the extracted word embeddings. On word2vec and the linguistic analysis, we also applied a KNN-classifier on the word embeddings. For BERT, we instead opted to apply an LSTM-network on the merged CLS-token embeddings, due to the sequential nature of BERT’s architecture.

For the audio data, all models achieved a Macro F1-score higher than 50% on a 95%-confidence interval, with at least one type of classifier. TF-IDF scored 58.2% ±2.6%, BERT 56.0% ±2.6%, word2vec 54.1% ±3.8% and linguistic analysis 53.6% ±3.0%.

For the text-form data, all models achieved a Macro F1-score higher than 50% on a 95%-confidence interval, with at least one type of classifier. TF-IDF scored
56.0% ±2.3%, BERT 57.4% ±0.9%, word2vec 56.0 ±2.1% and linguistic analysis 51.4% ±0.5%.

Each score reported is from using the best performing classifier for that model.

The above findings show that our models manage to learn something from the data, but due to rather small data sets and insurance cases from many different areas, it is quite difficult to draw any conclusions with high confidence. The results are not that much better than "guessing", and the small gain over 50% could be due to something else, such as bias in the data sets.

We feel that there is potential to use these techniques in a real setting, but the topic seems to need further exploration. We especially feel that there is potential in using transformer-based models, such as BERT, but currently it lacks the ability to analyse longer sequences due to computational limitations. With the current development pace of transformer models, it might be possible to use these in the future to get a better representation of what is being said, which hopefully would produce better results.}},
  author       = {{Bäcklund, Rasmus and Öhman, Hampus}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Theses in Mathematical Sciences}},
  title        = {{Detection of insurance fraud using NLP and ML}},
  year         = {{2023}},
}