What question did this study set out to answer?

The aim is to improve multimodal emotion recognition by effectively capturing emotional cues and their relationships in conversational contexts.

May 9, 2026Open Access

A novel graph-driven diffusion-transformer based dynamic emotion network for multimodal emotion recognition

Key Points

The aim is to improve multimodal emotion recognition by effectively capturing emotional cues and their relationships in conversational contexts.
Proposed a Graph-Driven Diffusion-Transformer integrated Dynamic Emotion Network (GDiffTransNet).
Utilized a Hierarchical Graph Fusion Network (HGFN) for inter- and intra-modal relationship modeling.
Employed a Gated Cross-Modal Transformer (GCMT) for enhanced feature integration across modalities.
Achieved 83.12%/82.52% W-Acc/W-F1 on IEMOCAP dataset.
Reached 72.94%/72.01% W-Acc/W-F1 on MELD dataset.
Obtained 74.8%/49.7% W-Acc/W-F1 on CMU-MOSEI dataset.

Abstract

Multimodal emotion recognition in conversational contexts has attracted increasing attention due to its ability to analyze human emotions by jointly modeling visual, textual, and audio cues in dynamic interactions. However, existing methods struggle to capture hierarchical relationships among emotional cues, are highly sensitive to noisy or missing data, and often fail to model fine-grained emotional transitions over time. These limitations hinder the interpretation of subtle emotional variations and lead to inconsistent predictions in real-world scenarios. To address these issues, we propose GDiffTransNet, a Graph-Driven Diffusion-Transformer integrated Dynamic Emotion Network. The framework introduces a Hierarchical Graph Fusion Network (HGFN) to capture inter- and intra-modal relationships, enabling fine-grained emotional dependency modeling. A Gated Cross-Modal Transformer (GCMT) is employed to dynamically regulate information flow through gated cross-attention, allowing effective feature integration across modalities. To improve robustness under incomplete or corrupted inputs, a diffusion module is designed to reconstruct missing or noisy modalities in a context-aware manner. Additionally, a Dynamic Emotion Network (DEN) models temporal and contextual variations, enhancing the recognition of evolving emotions. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed approach, achieving 83.12%/82.52% on IEMOCAP, 72.94%/72.01% on MELD, and 74.8%/49.7% on CMU-MOSEI in terms of W-Acc/W-F1.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper