Detection of insurance fraud using NLP and ML
(2023) In Master's Theses in Mathematical Sciences FMSM01 20231Mathematical Statistics
- Abstract
- Machine-Learning can sometimes see things we as humans can not. In this thesis we evaluated three different Natural Language Procces-techniques: BERT, word2vec and linguistic analysis (UDPipe), on their performance in detecting insurance fraud based on transcribed audio from phone calls (referred to as audio data) and written text (referred to as text-form data), related to insurance claims. We also included TF-IDF as a naive model. On all models we applied logistic regression on the extracted word embeddings. On word2vec and the linguistic analysis, we also applied a KNN-classifier on the word embeddings. For BERT, we instead opted to apply an LSTM-network on the merged CLS-token embeddings, due to the sequential nature of BERT’s... (More)
- Machine-Learning can sometimes see things we as humans can not. In this thesis we evaluated three different Natural Language Procces-techniques: BERT, word2vec and linguistic analysis (UDPipe), on their performance in detecting insurance fraud based on transcribed audio from phone calls (referred to as audio data) and written text (referred to as text-form data), related to insurance claims. We also included TF-IDF as a naive model. On all models we applied logistic regression on the extracted word embeddings. On word2vec and the linguistic analysis, we also applied a KNN-classifier on the word embeddings. For BERT, we instead opted to apply an LSTM-network on the merged CLS-token embeddings, due to the sequential nature of BERT’s architecture.
For the audio data, all models achieved a Macro F1-score higher than 50% on a 95%-confidence interval, with at least one type of classifier. TF-IDF scored 58.2% ±2.6%, BERT 56.0% ±2.6%, word2vec 54.1% ±3.8% and linguistic analysis 53.6% ±3.0%.
For the text-form data, all models achieved a Macro F1-score higher than 50% on a 95%-confidence interval, with at least one type of classifier. TF-IDF scored
56.0% ±2.3%, BERT 57.4% ±0.9%, word2vec 56.0 ±2.1% and linguistic analysis 51.4% ±0.5%.
Each score reported is from using the best performing classifier for that model.
The above findings show that our models manage to learn something from the data, but due to rather small data sets and insurance cases from many different areas, it is quite difficult to draw any conclusions with high confidence. The results are not that much better than "guessing", and the small gain over 50% could be due to something else, such as bias in the data sets.
We feel that there is potential to use these techniques in a real setting, but the topic seems to need further exploration. We especially feel that there is potential in using transformer-based models, such as BERT, but currently it lacks the ability to analyse longer sequences due to computational limitations. With the current development pace of transformer models, it might be possible to use these in the future to get a better representation of what is being said, which hopefully would produce better results. (Less) - Popular Abstract (Swedish)
- Försäkringsbolag behöver identifiera och hantera en stor mängd potentiellt bedrägliga försäkringsanspråk från deras kunder. Därför har försäkringsbolag en rad system för att försöka identifiera vilka anspråk som är bedrägliga och vilka som inte är det. Med den senaste utvecklingen inom maskininlärning och
särskilt inom området för språkbehandling (NLP) kan dessa tekniker eventuellt användas för att underlätta identifiering av bedrägliga anspråk.
Syftet med denna avhandling är därför att undersöka möjligheten att använda
NLP för det ovan nämnda syftet. Baserat på transkriberade samtal mellan kunder och företagsrepresentanter, samt på text skriven fritt av kunder eller som svar
på olika frågor (beroende på anspråkstypen) kommer vi... (More) - Försäkringsbolag behöver identifiera och hantera en stor mängd potentiellt bedrägliga försäkringsanspråk från deras kunder. Därför har försäkringsbolag en rad system för att försöka identifiera vilka anspråk som är bedrägliga och vilka som inte är det. Med den senaste utvecklingen inom maskininlärning och
särskilt inom området för språkbehandling (NLP) kan dessa tekniker eventuellt användas för att underlätta identifiering av bedrägliga anspråk.
Syftet med denna avhandling är därför att undersöka möjligheten att använda
NLP för det ovan nämnda syftet. Baserat på transkriberade samtal mellan kunder och företagsrepresentanter, samt på text skriven fritt av kunder eller som svar
på olika frågor (beroende på anspråkstypen) kommer vi att träna flera modeller
för att förutspå sannolikheten att ett anspråk är försäkringsbedrägeri. De tre
olika modellerna som används i detta projekt är BERT, word2vec och lingvistik analys.
Modellerna kommer att skapa numeriska representationer av dessa texter, vilka
sedan kommer att klassificeras med hjälp av olika tekniker inom maskininlärning. Slutligen kommer prestandan för varje modell att utvärderas för att avgöra om detta är ett användbart hjälpmedel i processen att undersöka potentiellt bedrägliga anspråk. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9131078
- author
- Bäcklund, Rasmus LU and Öhman, Hampus
- supervisor
- organization
- alternative title
- Detektion av försäkringsbedrägeri genom användning av NLP och ML
- course
- FMSM01 20231
- year
- 2023
- type
- H2 - Master's Degree (Two Years)
- subject
- publication/series
- Master's Theses in Mathematical Sciences
- report number
- LUTFMS-3489-2023
- ISSN
- 1404-6342
- other publication id
- 2023:E64
- language
- English
- id
- 9131078
- date added to LUP
- 2023-06-28 15:23:50
- date last changed
- 2023-07-03 14:11:17
@misc{9131078, abstract = {{Machine-Learning can sometimes see things we as humans can not. In this thesis we evaluated three different Natural Language Procces-techniques: BERT, word2vec and linguistic analysis (UDPipe), on their performance in detecting insurance fraud based on transcribed audio from phone calls (referred to as audio data) and written text (referred to as text-form data), related to insurance claims. We also included TF-IDF as a naive model. On all models we applied logistic regression on the extracted word embeddings. On word2vec and the linguistic analysis, we also applied a KNN-classifier on the word embeddings. For BERT, we instead opted to apply an LSTM-network on the merged CLS-token embeddings, due to the sequential nature of BERT’s architecture. For the audio data, all models achieved a Macro F1-score higher than 50% on a 95%-confidence interval, with at least one type of classifier. TF-IDF scored 58.2% ±2.6%, BERT 56.0% ±2.6%, word2vec 54.1% ±3.8% and linguistic analysis 53.6% ±3.0%. For the text-form data, all models achieved a Macro F1-score higher than 50% on a 95%-confidence interval, with at least one type of classifier. TF-IDF scored 56.0% ±2.3%, BERT 57.4% ±0.9%, word2vec 56.0 ±2.1% and linguistic analysis 51.4% ±0.5%. Each score reported is from using the best performing classifier for that model. The above findings show that our models manage to learn something from the data, but due to rather small data sets and insurance cases from many different areas, it is quite difficult to draw any conclusions with high confidence. The results are not that much better than "guessing", and the small gain over 50% could be due to something else, such as bias in the data sets. We feel that there is potential to use these techniques in a real setting, but the topic seems to need further exploration. We especially feel that there is potential in using transformer-based models, such as BERT, but currently it lacks the ability to analyse longer sequences due to computational limitations. With the current development pace of transformer models, it might be possible to use these in the future to get a better representation of what is being said, which hopefully would produce better results.}}, author = {{Bäcklund, Rasmus and Öhman, Hampus}}, issn = {{1404-6342}}, language = {{eng}}, note = {{Student Paper}}, series = {{Master's Theses in Mathematical Sciences}}, title = {{Detection of insurance fraud using NLP and ML}}, year = {{2023}}, }