Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis

Nykvist, Björn LU orcid ; Macura, Biljana ; Xylia, Maria and Olsson, Erik (2025) In Environmental Evidence 14(1).
Abstract

In this paper we show that OpenAI’s Large Language Model (LLM) GPT perform remarkably well when used for title and abstract eligibility screening of scientific articles and within a (systematic) literature review workflow. We evaluated GPT on screening data from a systematic review study on electric vehicle charging infrastructure demand with almost 12,000 records using the same eligibility criteria as human screeners. We tested 3 different versions of this model that were tasked to distinguishing between relevant and irrelevant content by responding with a relevance probability between 0 and 1. For the latest GPT-4 model (tested in November 2023) and probability cutoff 0.5 the recall rate is 100%, meaning no relevant papers were missed... (More)

In this paper we show that OpenAI’s Large Language Model (LLM) GPT perform remarkably well when used for title and abstract eligibility screening of scientific articles and within a (systematic) literature review workflow. We evaluated GPT on screening data from a systematic review study on electric vehicle charging infrastructure demand with almost 12,000 records using the same eligibility criteria as human screeners. We tested 3 different versions of this model that were tasked to distinguishing between relevant and irrelevant content by responding with a relevance probability between 0 and 1. For the latest GPT-4 model (tested in November 2023) and probability cutoff 0.5 the recall rate is 100%, meaning no relevant papers were missed and using this mode for screening would have saved 50% of the time that would otherwise be spent on manual screening. Experimenting with a higher cut of threshold can save more time. With threshold chosen so that recall is still above 95% for GPT-4 (where up to 5% of relevant papers might be missed), the model could save 75% of the time spent on manual screening. If automation technologies can replicate manual screening by human experts with effectiveness, accuracy, and precision, the work and cost savings are significant. Furthermore, the value of a comprehensive list of relevant literature, rather quickly available at the start of a research project, is hard to understate. However, as this study only evaluated the performance on one systematic review and one prompt, we caution that more test and methodological development is needed, and outline the next steps to properly evaluate rigor and effectiveness of LLMs for eligibility screening.

(Less)
Please use this url to cite or link to this publication:
author
; ; and
organization
publishing date
type
Contribution to journal
publication status
published
subject
keywords
Artificial Intelligence, Large Language Model, Study selection, Systematic maps, Systematic reviews
in
Environmental Evidence
volume
14
issue
1
article number
7
publisher
BioMed Central (BMC)
external identifiers
  • scopus:105003803065
  • pmid:40270055
ISSN
2047-2382
DOI
10.1186/s13750-025-00360-x
language
English
LU publication?
yes
id
729094c2-9cc7-4574-8bdc-49532ed332ad
date added to LUP
2025-07-15 10:57:25
date last changed
2025-07-16 03:00:11
@article{729094c2-9cc7-4574-8bdc-49532ed332ad,
  abstract     = {{<p>In this paper we show that OpenAI’s Large Language Model (LLM) GPT perform remarkably well when used for title and abstract eligibility screening of scientific articles and within a (systematic) literature review workflow. We evaluated GPT on screening data from a systematic review study on electric vehicle charging infrastructure demand with almost 12,000 records using the same eligibility criteria as human screeners. We tested 3 different versions of this model that were tasked to distinguishing between relevant and irrelevant content by responding with a relevance probability between 0 and 1. For the latest GPT-4 model (tested in November 2023) and probability cutoff 0.5 the recall rate is 100%, meaning no relevant papers were missed and using this mode for screening would have saved 50% of the time that would otherwise be spent on manual screening. Experimenting with a higher cut of threshold can save more time. With threshold chosen so that recall is still above 95% for GPT-4 (where up to 5% of relevant papers might be missed), the model could save 75% of the time spent on manual screening. If automation technologies can replicate manual screening by human experts with effectiveness, accuracy, and precision, the work and cost savings are significant. Furthermore, the value of a comprehensive list of relevant literature, rather quickly available at the start of a research project, is hard to understate. However, as this study only evaluated the performance on one systematic review and one prompt, we caution that more test and methodological development is needed, and outline the next steps to properly evaluate rigor and effectiveness of LLMs for eligibility screening.</p>}},
  author       = {{Nykvist, Björn and Macura, Biljana and Xylia, Maria and Olsson, Erik}},
  issn         = {{2047-2382}},
  keywords     = {{Artificial Intelligence; Large Language Model; Study selection; Systematic maps; Systematic reviews}},
  language     = {{eng}},
  number       = {{1}},
  publisher    = {{BioMed Central (BMC)}},
  series       = {{Environmental Evidence}},
  title        = {{Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis}},
  url          = {{http://dx.doi.org/10.1186/s13750-025-00360-x}},
  doi          = {{10.1186/s13750-025-00360-x}},
  volume       = {{14}},
  year         = {{2025}},
}