• Introduces a novel approach for multimodal emotion recognition from unaligned inputs. • Integrates text, audio, and video information for improved emotion recognition. • Captures long-term dependencies across multiple modalities. • Explores correlations among emotional states for improved classification. • Demonstrates effectiveness on the IEMOCAP dataset without pre-aligned data Multimodal emotion recognition is essential in affective computing, as it enables a more accurate and comprehensive understanding of human emotions by integrating diverse data modalities. However, current approaches still face key challenges, including the difficulty of handling unaligned multimodal inputs, limited ability to model long-term dependencies, and insufficient attention to relationships among emotional labels. To address these issues, this paper introduces a unified framework that combines a Pseudo-Alignment Algorithm (PAA) for processing unaligned data, a Multimodal Data Interaction Process (MDIP) for fusing text, audio, and video while preserving long-term contextual information, and a Deep Reinforcement Learning-based Emotion Detection (DRLED) model for exploring inter-emotional dependencies. Experiments conducted on the IEMOCAP benchmark dataset demonstrate that the proposed approach achieves strong emotion recognition performance without relying on pre-aligned multimodal data, highlighting its effectiveness and robustness in real-world scenarios.
Hamdaoui et al. (Sun,) studied this question.