Natural Language Response Formats for Assessing Depression and Worry With Large Language Models: A Sequential Evaluation With Model Pre-Registration

Gu, Zhuojun; Kjell, Katarina; Schwartz, H. Andrew; Kjell, Oscar

Natural Language Response Formats for Assessing Depression and Worry With Large Language Models: A Sequential Evaluation With Model Pre-Registration

Mark

Gu, Zhuojun ^LU ; Kjell, Katarina ^LU ; Schwartz, H. Andrew and Kjell, Oscar ^LU

(2025) In Assessment

Abstract: Large language models can transform individuals’ mental health descriptions into scores that correlate with rating scales approaching theoretical upper limits. However, such analyses have combined word- and text responses with little known about their differences. We develop response formats ranging from closed-ended to open-ended: (a) select words from lists, write (b) descriptive words, (c) phrases, or (d) texts. Participants answered questions about their depression/worry using the response formats and related rating scales. Language responses were transformed into word embeddings and trained to rating scales. We compare the validity (concurrent, incremental, face, discriminant, and external validity) and reliability (prospective sample... (More); Large language models can transform individuals’ mental health descriptions into scores that correlate with rating scales approaching theoretical upper limits. However, such analyses have combined word- and text responses with little known about their differences. We develop response formats ranging from closed-ended to open-ended: (a) select words from lists, write (b) descriptive words, (c) phrases, or (d) texts. Participants answered questions about their depression/worry using the response formats and related rating scales. Language responses were transformed into word embeddings and trained to rating scales. We compare the validity (concurrent, incremental, face, discriminant, and external validity) and reliability (prospective sample and test–retest reliability) of the response formats. Using the Sequential Evaluation with Model Pre-Registration design, machine-learning models were trained on a development dataset (N = 963), and then pre-registered before tested on a prospective sample (N = 145). The pre-registered models demonstrate strong validity and reliability, yielding high accuracy in the prospective sample (r = .60–.79). Additionally, the models demonstrated external validity to self-reported sick-leave/healthcare visits, where the text-format yielded the strongest correlations (being higher/equal to rating scales for 9 of 12 cases). The overall high validity and reliability across formats suggest the possibility of choosing formats according to clinical needs. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/af84a2c2-8a84-4e8c-9463-ae7786813ead

author

Gu, Zhuojun ^LU ; Kjell, Katarina ^LU ; Schwartz, H. Andrew and Kjell, Oscar ^LU

organization

publishing date

2025-09-20

type

Contribution to journal

publication status

epub

subject

keywords

artificial intelligence, large language models, natural language, natural language processing, psychological assessment, depression, anxiety

in

Assessment

publisher

SAGE Publications

external identifiers

scopus:105016495065

ISSN

1552-3489

DOI

10.1177/10731911251364022

language

English

LU publication?

yes

id

af84a2c2-8a84-4e8c-9463-ae7786813ead

date added to LUP

2025-10-06 11:15:10

date last changed

2025-10-14 12:53:42

@article{af84a2c2-8a84-4e8c-9463-ae7786813ead,
  abstract     = {{Large language models can transform individuals’ mental health descriptions into scores that correlate with rating scales approaching theoretical upper limits. However, such analyses have combined word- and text responses with little known about their differences. We develop response formats ranging from closed-ended to open-ended: (a) select words from lists, write (b) descriptive words, (c) phrases, or (d) texts. Participants answered questions about their depression/worry using the response formats and related rating scales. Language responses were transformed into word embeddings and trained to rating scales. We compare the validity (concurrent, incremental, face, discriminant, and external validity) and reliability (prospective sample and test–retest reliability) of the response formats. Using the Sequential Evaluation with Model Pre-Registration design, machine-learning models were trained on a development dataset (N = 963), and then pre-registered before tested on a prospective sample (N = 145). The pre-registered models demonstrate strong validity and reliability, yielding high accuracy in the prospective sample (r = .60–.79). Additionally, the models demonstrated external validity to self-reported sick-leave/healthcare visits, where the text-format yielded the strongest correlations (being higher/equal to rating scales for 9 of 12 cases). The overall high validity and reliability across formats suggest the possibility of choosing formats according to clinical needs.}},
  author       = {{Gu, Zhuojun and Kjell, Katarina and Schwartz, H. Andrew and Kjell, Oscar}},
  issn         = {{1552-3489}},
  keywords     = {{artificial intelligence; large language models; natural language; natural language processing; psychological assessment; depression; anxiety}},
  language     = {{eng}},
  month        = {{09}},
  publisher    = {{SAGE Publications}},
  series       = {{Assessment}},
  title        = {{Natural Language Response Formats for Assessing Depression and Worry With Large Language Models: A Sequential Evaluation With Model Pre-Registration}},
  url          = {{http://dx.doi.org/10.1177/10731911251364022}},
  doi          = {{10.1177/10731911251364022}},
  year         = {{2025}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Natural Language Response Formats for Assessing Depression and Worry With Large Language Models: A Sequential Evaluation With Model Pre-Registration