Multimodal emotion recognition in conversational contexts has attracted increasing attention due to its ability to analyze human emotions by jointly modeling visual, textual, and audio cues in dynamic interactions. However, existing methods struggle to capture hierarchical relationships among emotional cues, are highly sensitive to noisy or missing data, and often fail to model fine-grained emotional transitions over time. These limitations hinder the interpretation of subtle emotional variations and lead to inconsistent predictions in real-world scenarios. To address these issues, we propose GDiffTransNet, a Graph-Driven Diffusion-Transformer integrated Dynamic Emotion Network. The framework introduces a Hierarchical Graph Fusion Network (HGFN) to capture inter- and intra-modal relationships, enabling fine-grained emotional dependency modeling. A Gated Cross-Modal Transformer (GCMT) is employed to dynamically regulate information flow through gated cross-attention, allowing effective feature integration across modalities. To improve robustness under incomplete or corrupted inputs, a diffusion module is designed to reconstruct missing or noisy modalities in a context-aware manner. Additionally, a Dynamic Emotion Network (DEN) models temporal and contextual variations, enhancing the recognition of evolving emotions. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed approach, achieving 83.12%/82.52% on IEMOCAP, 72.94%/72.01% on MELD, and 74.8%/49.7% on CMU-MOSEI in terms of W-Acc/W-F1.
Arthanari et al. (Fri,) studied this question.