Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Natural Language Response Formats for Assessing Depression and Worry With Large Language Models: A Sequential Evaluation With Model Pre-Registration

Gu, Zhuojun LU ; Kjell, Katarina LU ; Schwartz, H. Andrew and Kjell, Oscar LU orcid (2025) In Assessment
Abstract
Large language models can transform individuals’ mental health descriptions into scores that correlate with rating scales approaching theoretical upper limits. However, such analyses have combined word- and text responses with little known about their differences. We develop response formats ranging from closed-ended to open-ended: (a) select words from lists, write (b) descriptive words, (c) phrases, or (d) texts. Participants answered questions about their depression/worry using the response formats and related rating scales. Language responses were transformed into word embeddings and trained to rating scales. We compare the validity (concurrent, incremental, face, discriminant, and external validity) and reliability (prospective sample... (More)
Large language models can transform individuals’ mental health descriptions into scores that correlate with rating scales approaching theoretical upper limits. However, such analyses have combined word- and text responses with little known about their differences. We develop response formats ranging from closed-ended to open-ended: (a) select words from lists, write (b) descriptive words, (c) phrases, or (d) texts. Participants answered questions about their depression/worry using the response formats and related rating scales. Language responses were transformed into word embeddings and trained to rating scales. We compare the validity (concurrent, incremental, face, discriminant, and external validity) and reliability (prospective sample and test–retest reliability) of the response formats. Using the Sequential Evaluation with Model Pre-Registration design, machine-learning models were trained on a development dataset (N = 963), and then pre-registered before tested on a prospective sample (N = 145). The pre-registered models demonstrate strong validity and reliability, yielding high accuracy in the prospective sample (r = .60–.79). Additionally, the models demonstrated external validity to self-reported sick-leave/healthcare visits, where the text-format yielded the strongest correlations (being higher/equal to rating scales for 9 of 12 cases). The overall high validity and reliability across formats suggest the possibility of choosing formats according to clinical needs. (Less)
Please use this url to cite or link to this publication:
author
; ; and
organization
publishing date
type
Contribution to journal
publication status
epub
subject
keywords
artificial intelligence, large language models, natural language, natural language processing, psychological assessment, depression, anxiety
in
Assessment
publisher
SAGE Publications
external identifiers
  • scopus:105016495065
ISSN
1552-3489
DOI
10.1177/10731911251364022
language
English
LU publication?
yes
id
af84a2c2-8a84-4e8c-9463-ae7786813ead
date added to LUP
2025-10-06 11:15:10
date last changed
2025-10-14 12:53:42
@article{af84a2c2-8a84-4e8c-9463-ae7786813ead,
  abstract     = {{Large language models can transform individuals’ mental health descriptions into scores that correlate with rating scales approaching theoretical upper limits. However, such analyses have combined word- and text responses with little known about their differences. We develop response formats ranging from closed-ended to open-ended: (a) select words from lists, write (b) descriptive words, (c) phrases, or (d) texts. Participants answered questions about their depression/worry using the response formats and related rating scales. Language responses were transformed into word embeddings and trained to rating scales. We compare the validity (concurrent, incremental, face, discriminant, and external validity) and reliability (prospective sample and test–retest reliability) of the response formats. Using the Sequential Evaluation with Model Pre-Registration design, machine-learning models were trained on a development dataset (N = 963), and then pre-registered before tested on a prospective sample (N = 145). The pre-registered models demonstrate strong validity and reliability, yielding high accuracy in the prospective sample (r = .60–.79). Additionally, the models demonstrated external validity to self-reported sick-leave/healthcare visits, where the text-format yielded the strongest correlations (being higher/equal to rating scales for 9 of 12 cases). The overall high validity and reliability across formats suggest the possibility of choosing formats according to clinical needs.}},
  author       = {{Gu, Zhuojun and Kjell, Katarina and Schwartz, H. Andrew and Kjell, Oscar}},
  issn         = {{1552-3489}},
  keywords     = {{artificial intelligence; large language models; natural language; natural language processing; psychological assessment; depression; anxiety}},
  language     = {{eng}},
  month        = {{09}},
  publisher    = {{SAGE Publications}},
  series       = {{Assessment}},
  title        = {{Natural Language Response Formats for Assessing Depression and Worry With Large Language Models: A Sequential Evaluation With Model Pre-Registration}},
  url          = {{http://dx.doi.org/10.1177/10731911251364022}},
  doi          = {{10.1177/10731911251364022}},
  year         = {{2025}},
}