Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis

Nykvist, Björn; Macura, Biljana; Xylia, Maria; Olsson, Erik

Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis

Mark

Nykvist, Björn ^LU

; Macura, Biljana ; Xylia, Maria and Olsson, Erik (2025) In Environmental Evidence 14(1).

Abstract: In this paper we show that OpenAI’s Large Language Model (LLM) GPT perform remarkably well when used for title and abstract eligibility screening of scientific articles and within a (systematic) literature review workflow. We evaluated GPT on screening data from a systematic review study on electric vehicle charging infrastructure demand with almost 12,000 records using the same eligibility criteria as human screeners. We tested 3 different versions of this model that were tasked to distinguishing between relevant and irrelevant content by responding with a relevance probability between 0 and 1. For the latest GPT-4 model (tested in November 2023) and probability cutoff 0.5 the recall rate is 100%, meaning no relevant papers were missed... (More); In this paper we show that OpenAI’s Large Language Model (LLM) GPT perform remarkably well when used for title and abstract eligibility screening of scientific articles and within a (systematic) literature review workflow. We evaluated GPT on screening data from a systematic review study on electric vehicle charging infrastructure demand with almost 12,000 records using the same eligibility criteria as human screeners. We tested 3 different versions of this model that were tasked to distinguishing between relevant and irrelevant content by responding with a relevance probability between 0 and 1. For the latest GPT-4 model (tested in November 2023) and probability cutoff 0.5 the recall rate is 100%, meaning no relevant papers were missed and using this mode for screening would have saved 50% of the time that would otherwise be spent on manual screening. Experimenting with a higher cut of threshold can save more time. With threshold chosen so that recall is still above 95% for GPT-4 (where up to 5% of relevant papers might be missed), the model could save 75% of the time spent on manual screening. If automation technologies can replicate manual screening by human experts with effectiveness, accuracy, and precision, the work and cost savings are significant. Furthermore, the value of a comprehensive list of relevant literature, rather quickly available at the start of a research project, is hard to understate. However, as this study only evaluated the performance on one systematic review and one prompt, we caution that more test and methodological development is needed, and outline the next steps to properly evaluate rigor and effectiveness of LLMs for eligibility screening.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/729094c2-9cc7-4574-8bdc-49532ed332ad

author

Nykvist, Björn ^LU

; Macura, Biljana ; Xylia, Maria and Olsson, Erik

organization

Environmental and Energy Systems Studies

publishing date

2025-12

type

Contribution to journal

publication status

published

subject

keywords

Artificial Intelligence, Large Language Model, Study selection, Systematic maps, Systematic reviews

in

Environmental Evidence

volume

14

issue

1

article number

7

publisher

BioMed Central (BMC)

external identifiers

pmid:40270055
scopus:105003803065

ISSN

2047-2382

DOI

10.1186/s13750-025-00360-x

language

English

LU publication?

yes

id

729094c2-9cc7-4574-8bdc-49532ed332ad

date added to LUP

2025-07-15 10:57:25

date last changed

2025-07-29 11:49:37

@article{729094c2-9cc7-4574-8bdc-49532ed332ad,
  abstract     = {{<p>In this paper we show that OpenAI’s Large Language Model (LLM) GPT perform remarkably well when used for title and abstract eligibility screening of scientific articles and within a (systematic) literature review workflow. We evaluated GPT on screening data from a systematic review study on electric vehicle charging infrastructure demand with almost 12,000 records using the same eligibility criteria as human screeners. We tested 3 different versions of this model that were tasked to distinguishing between relevant and irrelevant content by responding with a relevance probability between 0 and 1. For the latest GPT-4 model (tested in November 2023) and probability cutoff 0.5 the recall rate is 100%, meaning no relevant papers were missed and using this mode for screening would have saved 50% of the time that would otherwise be spent on manual screening. Experimenting with a higher cut of threshold can save more time. With threshold chosen so that recall is still above 95% for GPT-4 (where up to 5% of relevant papers might be missed), the model could save 75% of the time spent on manual screening. If automation technologies can replicate manual screening by human experts with effectiveness, accuracy, and precision, the work and cost savings are significant. Furthermore, the value of a comprehensive list of relevant literature, rather quickly available at the start of a research project, is hard to understate. However, as this study only evaluated the performance on one systematic review and one prompt, we caution that more test and methodological development is needed, and outline the next steps to properly evaluate rigor and effectiveness of LLMs for eligibility screening.</p>}},
  author       = {{Nykvist, Björn and Macura, Biljana and Xylia, Maria and Olsson, Erik}},
  issn         = {{2047-2382}},
  keywords     = {{Artificial Intelligence; Large Language Model; Study selection; Systematic maps; Systematic reviews}},
  language     = {{eng}},
  number       = {{1}},
  publisher    = {{BioMed Central (BMC)}},
  series       = {{Environmental Evidence}},
  title        = {{Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis}},
  url          = {{http://dx.doi.org/10.1186/s13750-025-00360-x}},
  doi          = {{10.1186/s13750-025-00360-x}},
  volume       = {{14}},
  year         = {{2025}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis