What question did this study set out to answer?

The aim is to explore the applicability of large language models (LLMs) in psychometrics and assess their performance in identifying cognitive emotion regulation strategies.

April 22, 2026Open Access

Exploring the future of psychometrics from a Large Language Model perspective: A case study analysis

Key Points

The aim is to explore the applicability of large language models (LLMs) in psychometrics and assess their performance in identifying cognitive emotion regulation strategies.
Identified four deployment scenarios for LLMs in psychological assessment.
Conducted a multiclass classification and binary verification task using four LLMs on a dataset of Polish-language trauma narratives.
Evaluated models based on their ability to detect cognitive emotion regulation strategies.
GPT-4 achieved the highest F1 score of 0.442 in multiclass tasks and a TNR of 0.838.
All models showed a tendency to overinterpret data and struggled to differentiate similar strategies.
Conclusions indicate LLMs are not yet suitable for autonomous clinical use without human oversight.

Abstract

This article explores the applicability of LLMs in psychometrics. We first identify and evaluate four deployment scenarios for LLMs in psychological assessment: (1) preliminary screening, (2) psychologist’s assistant, (3) autonomous psychological agent, and (4) psychological agent with expert oversight, discussing their respective benefits, risks, and ethical considerations. In the experimental part, we assess the ability of four LLMs: GPT-3.5, GPT-4, Mixtral-8x7B, and OpenChat-3.5 to identify nine cognitive emotion regulation strategies in a dataset of 515 annotated Polish-language trauma narratives. Two tasks were designed: a multiclass classification task and a binary yes/no verification task. GPT-4 achieved the best overall performance, reaching an F1 score of 0.442 in the multiclass task and 0.346 in the binary task, while also demonstrating the highest TNR of 0.838. Nevertheless, all models exhibited a tendency towards overinterpretation and struggled to distinguish between conceptually similar strategies. These findings suggest that current LLMs are not yet suitable for autonomous clinical deployment and should be integrated into psychometric practice only under qualified human oversight. • LLM roles in psychometrics: screening, assistant, autonomous and expert-supervised agent. • LLMs can not detect how individual manage stress via cognitive effort from text alone. • GPT-4 and GPT-3.5 Turbo were the most accurate models in detecting strategies. • Mixtral and OpenChat were the two most conservative models that did not overinterpret the presence of the strategy.

Read Full Paperexternally

AIに質問

Bookmark

View Full Paper

Cite This Study

Mieleszczenko-Kowszewicz et al. (Mon,) studied this question.

synapsesocial.com/papers/69e864c46e0dea528dde97f4 https://doi.org/https://doi.org/10.1016/j.chbr.2026.101060

AIに質問

Bookmark

View Full Paper