Language discrepancies in the performance of generative artificial intelligence models : an examination of infectious disease queries in English and Arabic

Sallam, Malik; Al-Mahzoum, Kholoud; Alshuaib, Omaima; Alhajri, Hawajer; Alotaibi, Fatmah; Alkhurainej, Dalal; Al-Balwah, Mohammad Yahya; Barakat, Muna; Egger, Jan

Language discrepancies in the performance of generative artificial intelligence models : an examination of infectious disease queries in English and Arabic

Mark

Sallam, Malik ^LU ; Al-Mahzoum, Kholoud ; Alshuaib, Omaima ; Alhajri, Hawajer ; Alotaibi, Fatmah ; Alkhurainej, Dalal ; Al-Balwah, Mohammad Yahya ; Barakat, Muna and Egger, Jan (2024) In BMC Infectious Diseases 24(1).

Abstract: Background: Assessment of artificial intelligence (AI)-based models across languages is crucial to ensure equitable access and accuracy of information in multilingual contexts. This study aimed to compare AI model efficiency in English and Arabic for infectious disease queries. Methods: The study employed the METRICS checklist for the design and reporting of AI-based studies in healthcare. The AI models tested included ChatGPT-3.5, ChatGPT-4, Bing, and Bard. The queries comprised 15 questions on HIV/AIDS, tuberculosis, malaria, COVID-19, and influenza. The AI-generated content was assessed by two bilingual experts using the validated CLEAR tool. Results: In comparing AI models’ performance in English and Arabic for infectious disease... (More); Background: Assessment of artificial intelligence (AI)-based models across languages is crucial to ensure equitable access and accuracy of information in multilingual contexts. This study aimed to compare AI model efficiency in English and Arabic for infectious disease queries. Methods: The study employed the METRICS checklist for the design and reporting of AI-based studies in healthcare. The AI models tested included ChatGPT-3.5, ChatGPT-4, Bing, and Bard. The queries comprised 15 questions on HIV/AIDS, tuberculosis, malaria, COVID-19, and influenza. The AI-generated content was assessed by two bilingual experts using the validated CLEAR tool. Results: In comparing AI models’ performance in English and Arabic for infectious disease queries, variability was noted. English queries showed consistently superior performance, with Bard leading, followed by Bing, ChatGPT-4, and ChatGPT-3.5 (P =.012). The same trend was observed in Arabic, albeit without statistical significance (P =.082). Stratified analysis revealed higher scores for English in most CLEAR components, notably in completeness, accuracy, appropriateness, and relevance, especially with ChatGPT-3.5 and Bard. Across the five infectious disease topics, English outperformed Arabic, except for flu queries in Bing and Bard. The four AI models’ performance in English was rated as “excellent”, significantly outperforming their “above-average” Arabic counterparts (P =.002). Conclusions: Disparity in AI model performance was noticed between English and Arabic in response to infectious disease queries. This language variation can negatively impact the quality of health content delivered by AI models among native speakers of Arabic. This issue is recommended to be addressed by AI developers, with the ultimate goal of enhancing health outcomes.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/e4ed284c-df10-493b-902d-d1b73faafa75

author

Sallam, Malik ^LU ; Al-Mahzoum, Kholoud ; Alshuaib, Omaima ; Alhajri, Hawajer ; Alotaibi, Fatmah ; Alkhurainej, Dalal ; Al-Balwah, Mohammad Yahya ; Barakat, Muna and Egger, Jan

organization

Clinical Virology, Malmö (research group)

publishing date

2024-12

type

Contribution to journal

publication status

published

subject

Studies of Specific Languages

keywords

AI chatbots, Digital health queries, Healthcare technology, Infectious diseases, Language performance

in

BMC Infectious Diseases

volume

24

issue

1

article number

799

publisher

BioMed Central (BMC)

external identifiers

pmid:39118057
scopus:85200897290

ISSN

1471-2334

DOI

10.1186/s12879-024-09725-y

language

English

LU publication?

yes

id

e4ed284c-df10-493b-902d-d1b73faafa75

date added to LUP

2024-08-26 14:00:34

date last changed

2026-02-25 15:33:09

@article{e4ed284c-df10-493b-902d-d1b73faafa75,
  abstract     = {{<p>Background: Assessment of artificial intelligence (AI)-based models across languages is crucial to ensure equitable access and accuracy of information in multilingual contexts. This study aimed to compare AI model efficiency in English and Arabic for infectious disease queries. Methods: The study employed the METRICS checklist for the design and reporting of AI-based studies in healthcare. The AI models tested included ChatGPT-3.5, ChatGPT-4, Bing, and Bard. The queries comprised 15 questions on HIV/AIDS, tuberculosis, malaria, COVID-19, and influenza. The AI-generated content was assessed by two bilingual experts using the validated CLEAR tool. Results: In comparing AI models’ performance in English and Arabic for infectious disease queries, variability was noted. English queries showed consistently superior performance, with Bard leading, followed by Bing, ChatGPT-4, and ChatGPT-3.5 (P =.012). The same trend was observed in Arabic, albeit without statistical significance (P =.082). Stratified analysis revealed higher scores for English in most CLEAR components, notably in completeness, accuracy, appropriateness, and relevance, especially with ChatGPT-3.5 and Bard. Across the five infectious disease topics, English outperformed Arabic, except for flu queries in Bing and Bard. The four AI models’ performance in English was rated as “excellent”, significantly outperforming their “above-average” Arabic counterparts (P =.002). Conclusions: Disparity in AI model performance was noticed between English and Arabic in response to infectious disease queries. This language variation can negatively impact the quality of health content delivered by AI models among native speakers of Arabic. This issue is recommended to be addressed by AI developers, with the ultimate goal of enhancing health outcomes.</p>}},
  author       = {{Sallam, Malik and Al-Mahzoum, Kholoud and Alshuaib, Omaima and Alhajri, Hawajer and Alotaibi, Fatmah and Alkhurainej, Dalal and Al-Balwah, Mohammad Yahya and Barakat, Muna and Egger, Jan}},
  issn         = {{1471-2334}},
  keywords     = {{AI chatbots; Digital health queries; Healthcare technology; Infectious diseases; Language performance}},
  language     = {{eng}},
  number       = {{1}},
  publisher    = {{BioMed Central (BMC)}},
  series       = {{BMC Infectious Diseases}},
  title        = {{Language discrepancies in the performance of generative artificial intelligence models : an examination of infectious disease queries in English and Arabic}},
  url          = {{http://dx.doi.org/10.1186/s12879-024-09725-y}},
  doi          = {{10.1186/s12879-024-09725-y}},
  volume       = {{24}},
  year         = {{2024}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Language discrepancies in the performance of generative artificial intelligence models : an examination of infectious disease queries in English and Arabic