What type of study is this?

This is a Quantitative Study study.

October 13, 2025Open Access

Combating Hallucinations in Large Language Models: A Multi-Scale Dialogue Semantic Modeling Framework for Automated Depression Risk Assessment

Puntos clave

The proposed framework improves depression risk assessment by reducing hallucinations in large language models.
Evaluation on the DAIC-WOZ dataset showed a significant correlation (Pearson r = 0.749) with actual PHQ-8 scores.
Mean absolute error for the development set was 3.1, indicating strong predictive accuracy for depression risk.
Multi-scale semantic modeling effectively captures local and broader context in dialogue data, enhancing response reliability.

Resumen

Current large language model (LLM) approaches for depression detection, which generate response vectors from prompts, often yield transcribed text that is informationally incomplete and semantically ambiguous. This frequently results in responses that seem superficially plausible yet are factually incorrect due to hallucinated reasoning. As a consequence, response vectors become contaminated with spurious information, compromising the reliability of detection outcomes. To address the challenge of hallucination in LLMs particularly in contexts with scarce conversational history or pronounced semantic ambiguity this paper introduces a novel multi-scale semantic modeling algorithm based on question-answering dialogues. The proposed method aims to support fully automated processing of dialogue data for individual depression risk prediction. Our algorithm constructs semantic representations at multiple scales using Q&A dialogue data. The first scale captures local semantics within a single dialogue turn, while subsequent scales incorporate context across two consecutive turns to model broader discourse information. Integrated with a tailored neural network architecture, the framework extracts semantic features indicative of depression risk. The methodology was evaluated experimentally using the DAIC-WOZ dataset. Results indicated a strong correlation between the screening outcomes of our depression risk assessment algorithm and actual PHQ-8 scores on the development set (Pearson r = 0.749, p < 0.05). In terms of predictive accuracy, the development set achieved a mean absolute error (MAE) of 3.1 and a root mean square error (RMSE) of 4.1, while the test set obtained an MAE of 4.12 and an RMSE of 4.79.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo