Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis
(2025) In Environmental Evidence 14(1).- Abstract
In this paper we show that OpenAI’s Large Language Model (LLM) GPT perform remarkably well when used for title and abstract eligibility screening of scientific articles and within a (systematic) literature review workflow. We evaluated GPT on screening data from a systematic review study on electric vehicle charging infrastructure demand with almost 12,000 records using the same eligibility criteria as human screeners. We tested 3 different versions of this model that were tasked to distinguishing between relevant and irrelevant content by responding with a relevance probability between 0 and 1. For the latest GPT-4 model (tested in November 2023) and probability cutoff 0.5 the recall rate is 100%, meaning no relevant papers were missed... (More)
In this paper we show that OpenAI’s Large Language Model (LLM) GPT perform remarkably well when used for title and abstract eligibility screening of scientific articles and within a (systematic) literature review workflow. We evaluated GPT on screening data from a systematic review study on electric vehicle charging infrastructure demand with almost 12,000 records using the same eligibility criteria as human screeners. We tested 3 different versions of this model that were tasked to distinguishing between relevant and irrelevant content by responding with a relevance probability between 0 and 1. For the latest GPT-4 model (tested in November 2023) and probability cutoff 0.5 the recall rate is 100%, meaning no relevant papers were missed and using this mode for screening would have saved 50% of the time that would otherwise be spent on manual screening. Experimenting with a higher cut of threshold can save more time. With threshold chosen so that recall is still above 95% for GPT-4 (where up to 5% of relevant papers might be missed), the model could save 75% of the time spent on manual screening. If automation technologies can replicate manual screening by human experts with effectiveness, accuracy, and precision, the work and cost savings are significant. Furthermore, the value of a comprehensive list of relevant literature, rather quickly available at the start of a research project, is hard to understate. However, as this study only evaluated the performance on one systematic review and one prompt, we caution that more test and methodological development is needed, and outline the next steps to properly evaluate rigor and effectiveness of LLMs for eligibility screening.
(Less)
- author
- Nykvist, Björn
LU
; Macura, Biljana ; Xylia, Maria and Olsson, Erik
- organization
- publishing date
- 2025-12
- type
- Contribution to journal
- publication status
- published
- subject
- keywords
- Artificial Intelligence, Large Language Model, Study selection, Systematic maps, Systematic reviews
- in
- Environmental Evidence
- volume
- 14
- issue
- 1
- article number
- 7
- publisher
- BioMed Central (BMC)
- external identifiers
-
- scopus:105003803065
- pmid:40270055
- ISSN
- 2047-2382
- DOI
- 10.1186/s13750-025-00360-x
- language
- English
- LU publication?
- yes
- id
- 729094c2-9cc7-4574-8bdc-49532ed332ad
- date added to LUP
- 2025-07-15 10:57:25
- date last changed
- 2025-07-16 03:00:11
@article{729094c2-9cc7-4574-8bdc-49532ed332ad, abstract = {{<p>In this paper we show that OpenAI’s Large Language Model (LLM) GPT perform remarkably well when used for title and abstract eligibility screening of scientific articles and within a (systematic) literature review workflow. We evaluated GPT on screening data from a systematic review study on electric vehicle charging infrastructure demand with almost 12,000 records using the same eligibility criteria as human screeners. We tested 3 different versions of this model that were tasked to distinguishing between relevant and irrelevant content by responding with a relevance probability between 0 and 1. For the latest GPT-4 model (tested in November 2023) and probability cutoff 0.5 the recall rate is 100%, meaning no relevant papers were missed and using this mode for screening would have saved 50% of the time that would otherwise be spent on manual screening. Experimenting with a higher cut of threshold can save more time. With threshold chosen so that recall is still above 95% for GPT-4 (where up to 5% of relevant papers might be missed), the model could save 75% of the time spent on manual screening. If automation technologies can replicate manual screening by human experts with effectiveness, accuracy, and precision, the work and cost savings are significant. Furthermore, the value of a comprehensive list of relevant literature, rather quickly available at the start of a research project, is hard to understate. However, as this study only evaluated the performance on one systematic review and one prompt, we caution that more test and methodological development is needed, and outline the next steps to properly evaluate rigor and effectiveness of LLMs for eligibility screening.</p>}}, author = {{Nykvist, Björn and Macura, Biljana and Xylia, Maria and Olsson, Erik}}, issn = {{2047-2382}}, keywords = {{Artificial Intelligence; Large Language Model; Study selection; Systematic maps; Systematic reviews}}, language = {{eng}}, number = {{1}}, publisher = {{BioMed Central (BMC)}}, series = {{Environmental Evidence}}, title = {{Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis}}, url = {{http://dx.doi.org/10.1186/s13750-025-00360-x}}, doi = {{10.1186/s13750-025-00360-x}}, volume = {{14}}, year = {{2025}}, }