This article explores the applicability of LLMs in psychometrics. We first identify and evaluate four deployment scenarios for LLMs in psychological assessment: (1) preliminary screening, (2) psychologist’s assistant, (3) autonomous psychological agent, and (4) psychological agent with expert oversight, discussing their respective benefits, risks, and ethical considerations. In the experimental part, we assess the ability of four LLMs: GPT-3.5, GPT-4, Mixtral-8x7B, and OpenChat-3.5 to identify nine cognitive emotion regulation strategies in a dataset of 515 annotated Polish-language trauma narratives. Two tasks were designed: a multiclass classification task and a binary yes/no verification task. GPT-4 achieved the best overall performance, reaching an F1 score of 0.442 in the multiclass task and 0.346 in the binary task, while also demonstrating the highest TNR of 0.838. Nevertheless, all models exhibited a tendency towards overinterpretation and struggled to distinguish between conceptually similar strategies. These findings suggest that current LLMs are not yet suitable for autonomous clinical deployment and should be integrated into psychometric practice only under qualified human oversight. • LLM roles in psychometrics: screening, assistant, autonomous and expert-supervised agent. • LLMs can not detect how individual manage stress via cognitive effort from text alone. • GPT-4 and GPT-3.5 Turbo were the most accurate models in detecting strategies. • Mixtral and OpenChat were the two most conservative models that did not overinterpret the presence of the strategy.
Mieleszczenko-Kowszewicz et al. (Mon,) studied this question.