Lost in Translation: AI’s Struggles with Scanian – A Study on Language Models’ Attempts to Conquer Swedish and its Dialects
(2025) ALSK13 20242Division of Linguistics and Cognitive Semiotics
General Linguistics
- Abstract
- The recent surge in the development of artificial intelligence has greatly improved the accuracy of automatic speech recognition. However, as with all abrupt technical improvements, it will have its weaknesses, in this case the lack of dialectal training data. This thesis investigates if AI has an unevenly distributed understanding of regional speech, focusing on the disregard of southern Swedish dialects found in the region of Scania. This is done by comparing the word error rate of spontaneous AI-generated transcriptions of standard Swedish speech to AI-generated transcription of spontaneous Scanian speech and further analyzed to find potential causes as to why it could be hard for AI to understand it. The results show that AI is... (More)
- The recent surge in the development of artificial intelligence has greatly improved the accuracy of automatic speech recognition. However, as with all abrupt technical improvements, it will have its weaknesses, in this case the lack of dialectal training data. This thesis investigates if AI has an unevenly distributed understanding of regional speech, focusing on the disregard of southern Swedish dialects found in the region of Scania. This is done by comparing the word error rate of spontaneous AI-generated transcriptions of standard Swedish speech to AI-generated transcription of spontaneous Scanian speech and further analyzed to find potential causes as to why it could be hard for AI to understand it. The results show that AI is significantly worse at understanding regional Scanian dialects compared to standard Swedish. This study highlights how AI is unproportionally trained using skewed data, favoring the speech of standardized language. (Less)
- Abstract (Swedish)
- Under de senaste åren har utvecklingen av artificiell intelligens frodats och med sig tagit en förbättrad träffsäkerhet i taligenkänning. Däremot, liksom med alla snabba teknologiska utvecklingar har den sina brister, i det här fallet bristen på dialektal träningsdata. Den här uppsatsen undersöker om AI har en ojämnt fördelad förståelse av regionalt språk, med fokus på försummandet av de sydsvenska dialekterna i Skåne. Detta görs genom att jämföra ordfelsfrekvensen i AI-genererade transkriptioner av naturligt rikssvenskt tal och naturligt skånskt tal, och vidare analyserad för att hitta potentiella orsaker till varför det är så svårt för AI att förstå skånska. Resultaten visar att AI är signifikant sämre på att förstå skånska dialekter... (More)
- Under de senaste åren har utvecklingen av artificiell intelligens frodats och med sig tagit en förbättrad träffsäkerhet i taligenkänning. Däremot, liksom med alla snabba teknologiska utvecklingar har den sina brister, i det här fallet bristen på dialektal träningsdata. Den här uppsatsen undersöker om AI har en ojämnt fördelad förståelse av regionalt språk, med fokus på försummandet av de sydsvenska dialekterna i Skåne. Detta görs genom att jämföra ordfelsfrekvensen i AI-genererade transkriptioner av naturligt rikssvenskt tal och naturligt skånskt tal, och vidare analyserad för att hitta potentiella orsaker till varför det är så svårt för AI att förstå skånska. Resultaten visar att AI är signifikant sämre på att förstå skånska dialekter jämfört med rikssvenska. Den här studien lyfter fram hur AI är oproportionellt tränad med ojämnt fördelad data, vilket får AI att föredra standardiseringen av språk. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9191345
- author
- Vesterberg, Kajsa LU
- supervisor
-
- Johan Frid LU
- organization
- course
- ALSK13 20242
- year
- 2025
- type
- M2 - Bachelor Degree
- subject
- keywords
- Artificial intelligence, automatic speech recognition, dialect, Scanian, Swedish, transcription
- language
- English
- id
- 9191345
- date added to LUP
- 2025-05-28 11:22:27
- date last changed
- 2025-05-28 11:22:27
@misc{9191345, abstract = {{The recent surge in the development of artificial intelligence has greatly improved the accuracy of automatic speech recognition. However, as with all abrupt technical improvements, it will have its weaknesses, in this case the lack of dialectal training data. This thesis investigates if AI has an unevenly distributed understanding of regional speech, focusing on the disregard of southern Swedish dialects found in the region of Scania. This is done by comparing the word error rate of spontaneous AI-generated transcriptions of standard Swedish speech to AI-generated transcription of spontaneous Scanian speech and further analyzed to find potential causes as to why it could be hard for AI to understand it. The results show that AI is significantly worse at understanding regional Scanian dialects compared to standard Swedish. This study highlights how AI is unproportionally trained using skewed data, favoring the speech of standardized language.}}, author = {{Vesterberg, Kajsa}}, language = {{eng}}, note = {{Student Paper}}, title = {{Lost in Translation: AI’s Struggles with Scanian – A Study on Language Models’ Attempts to Conquer Swedish and its Dialects}}, year = {{2025}}, }