Depression is a common yet highly covert mental disorder, making the development of efficient intelligent recognition methods crucial for early screening and clinical diagnostic support. Existing multimodal depression recognition approaches still face limitations in modal interaction and long-sequence semantic modeling, struggling to fully capture local dynamics and cross-modal dependencies. To address this, this study proposes a multimodal temporal fusion network. This approach first divides long medical interview sequences into sentence-level units based on timestamps to mitigate information dilution in lengthy sequences. Subsequently, it designs a sentence-level dynamic multimodal attention fusion module. This module further segments sentence sequences into contiguous segments and adaptively emphasizes key modal features while suppressing redundant and noisy information through dynamic weight allocation. On the public dataset DAIC-WOZ and the self-built Chinese dataset MDD2025, MTFNet achieves accuracy rates of 86% and 84%, respectively.
Building similarity graph...
Analyzing shared references across papers
Loading...
Mingyang Sun
Harbin Medical University
Shukui Ma
Taiyuan Normal University
Guangping Zhuo
Taiyuan Normal University
Frontiers in Computing and Intelligent Systems
Building similarity graph...
Analyzing shared references across papers
Loading...
Sun et al. (Fri,) studied this question.
synapsesocial.com/papers/6906a3a98b61f987b17a0110 — DOI: https://doi.org/10.54097/zs7j8602