August 18, 2025Open Access

Evaluating Psychological Competency via Chinese Q&A in Large Language Models

Key Points

Higher accuracy was noted in closed-ended questions compared to open-ended ones, indicating question type matters.
The evaluation assessed various LLMs, including Mixture-of-Expert models and differing parameter sizes, in psychological QA.
Error analysis indicated that hallucinated responses often arise from inadequate psychological knowledge and confusion.
Despite advancements, caution is advised in reliance on AI outputs, promoting collaboration with human experts for better outcomes.

Abstract

Recently, the application of large language models (LLMs) in psychology has gained increasing attention. However, their psychological competence still requires further investigation. This study explores this issue through the lens of Chinese psychological knowledge question answering (QA). Specifically, we constructed a dedicated dataset based on Chinese qualification examinations for psychological counselors and psychotherapists. Subsequently, we evaluated dense, Mixture-of-Expert, and reasoning LLMs with varying parameter sizes and evaluation modes in the Chinese context, measuring answer accuracy in both closed-ended and open-ended settings. The experimental results showed that the larger and more recent LLMs achieved higher accuracy in psychological QA. While few-shot learning led to improvements in accuracy, Chain-of-Thought prompting and reasoning LLMs provided only limited gains. Notably, LLMs achieved higher accuracy in closed-ended settings than in open-ended ones. Furthermore, error analysis indicated that LLMs can produce incorrect or hallucinated responses, primarily due to insufficient psychological knowledge and conceptual confusion. Although current LLMs show promise in psychological QA tasks, users should remain cautious about over-reliance on their responses. A complementary, human-AI collaborative approach is recommended for practical use.

Evaluating Psychological Competency via Chinese Q&A in Large Language Models

Key Points

Abstract

Cite This Study

Also Consider

Also Consider