Depression is a common yet highly covert mental disorder, making the development of efficient intelligent recognition methods crucial for early screening and clinical diagnostic support. Existing multimodal depression recognition approaches still face limitations in modal interaction and long-sequence semantic modeling, struggling to fully capture local dynamics and cross-modal dependencies. To address this, this study proposes a multimodal temporal fusion network. This approach first divides long medical interview sequences into sentence-level units based on timestamps to mitigate information dilution in lengthy sequences. Subsequently, it designs a sentence-level dynamic multimodal attention fusion module. This module further segments sentence sequences into contiguous segments and adaptively emphasizes key modal features while suppressing redundant and noisy information through dynamic weight allocation. On the public dataset DAIC-WOZ and the self-built Chinese dataset MDD2025, MTFNet achieves accuracy rates of 86% and 84%, respectively.
Sun et al. (Fri,) studied this question.