Natural Language Response Formats for Assessing Depression and Worry With Large Language Models: A Sequential Evaluation With Model Pre-Registration
(2025) In Assessment- Abstract
- Large language models can transform individuals’ mental health descriptions into scores that correlate with rating scales approaching theoretical upper limits. However, such analyses have combined word- and text responses with little known about their differences. We develop response formats ranging from closed-ended to open-ended: (a) select words from lists, write (b) descriptive words, (c) phrases, or (d) texts. Participants answered questions about their depression/worry using the response formats and related rating scales. Language responses were transformed into word embeddings and trained to rating scales. We compare the validity (concurrent, incremental, face, discriminant, and external validity) and reliability (prospective sample... (More)
- Large language models can transform individuals’ mental health descriptions into scores that correlate with rating scales approaching theoretical upper limits. However, such analyses have combined word- and text responses with little known about their differences. We develop response formats ranging from closed-ended to open-ended: (a) select words from lists, write (b) descriptive words, (c) phrases, or (d) texts. Participants answered questions about their depression/worry using the response formats and related rating scales. Language responses were transformed into word embeddings and trained to rating scales. We compare the validity (concurrent, incremental, face, discriminant, and external validity) and reliability (prospective sample and test–retest reliability) of the response formats. Using the Sequential Evaluation with Model Pre-Registration design, machine-learning models were trained on a development dataset (N = 963), and then pre-registered before tested on a prospective sample (N = 145). The pre-registered models demonstrate strong validity and reliability, yielding high accuracy in the prospective sample (r = .60–.79). Additionally, the models demonstrated external validity to self-reported sick-leave/healthcare visits, where the text-format yielded the strongest correlations (being higher/equal to rating scales for 9 of 12 cases). The overall high validity and reliability across formats suggest the possibility of choosing formats according to clinical needs. (Less)
Please use this url to cite or link to this publication:
https://lup.lub.lu.se/record/af84a2c2-8a84-4e8c-9463-ae7786813ead
- author
- Gu, Zhuojun
LU
; Kjell, Katarina
LU
; Schwartz, H. Andrew
and Kjell, Oscar
LU
- organization
- publishing date
- 2025-09-20
- type
- Contribution to journal
- publication status
- epub
- subject
- keywords
- artificial intelligence, large language models, natural language, natural language processing, psychological assessment, depression, anxiety
- in
- Assessment
- publisher
- SAGE Publications
- external identifiers
-
- scopus:105016495065
- ISSN
- 1552-3489
- DOI
- 10.1177/10731911251364022
- language
- English
- LU publication?
- yes
- id
- af84a2c2-8a84-4e8c-9463-ae7786813ead
- date added to LUP
- 2025-10-06 11:15:10
- date last changed
- 2025-10-14 12:53:42
@article{af84a2c2-8a84-4e8c-9463-ae7786813ead,
abstract = {{Large language models can transform individuals’ mental health descriptions into scores that correlate with rating scales approaching theoretical upper limits. However, such analyses have combined word- and text responses with little known about their differences. We develop response formats ranging from closed-ended to open-ended: (a) select words from lists, write (b) descriptive words, (c) phrases, or (d) texts. Participants answered questions about their depression/worry using the response formats and related rating scales. Language responses were transformed into word embeddings and trained to rating scales. We compare the validity (concurrent, incremental, face, discriminant, and external validity) and reliability (prospective sample and test–retest reliability) of the response formats. Using the Sequential Evaluation with Model Pre-Registration design, machine-learning models were trained on a development dataset (N = 963), and then pre-registered before tested on a prospective sample (N = 145). The pre-registered models demonstrate strong validity and reliability, yielding high accuracy in the prospective sample (r = .60–.79). Additionally, the models demonstrated external validity to self-reported sick-leave/healthcare visits, where the text-format yielded the strongest correlations (being higher/equal to rating scales for 9 of 12 cases). The overall high validity and reliability across formats suggest the possibility of choosing formats according to clinical needs.}},
author = {{Gu, Zhuojun and Kjell, Katarina and Schwartz, H. Andrew and Kjell, Oscar}},
issn = {{1552-3489}},
keywords = {{artificial intelligence; large language models; natural language; natural language processing; psychological assessment; depression; anxiety}},
language = {{eng}},
month = {{09}},
publisher = {{SAGE Publications}},
series = {{Assessment}},
title = {{Natural Language Response Formats for Assessing Depression and Worry With Large Language Models: A Sequential Evaluation With Model Pre-Registration}},
url = {{http://dx.doi.org/10.1177/10731911251364022}},
doi = {{10.1177/10731911251364022}},
year = {{2025}},
}