What question did this study set out to answer?

This research aims to enhance emotion recognition by integrating unaligned multimodal data from text, audio, and video.

February 27, 2026Open Access

A Deep Reinforcement Learning Approach for Emotion Recognition from Unaligned Multimodal Inputs

Key Points

This research aims to enhance emotion recognition by integrating unaligned multimodal data from text, audio, and video.
Developed a unified framework for processing unaligned data.
Implemented a pseudo-alignment algorithm to handle input discrepancies.
Used a multimodal data interaction process for fusing data types while preserving context.
Applied a deep reinforcement learning model for detecting inter-emotional dependencies.
Achieved high emotion recognition accuracy on the IEMOCAP dataset.
Demonstrated effectiveness in scenarios without pre-aligned multimodal inputs.

Abstract

• Introduces a novel approach for multimodal emotion recognition from unaligned inputs. • Integrates text, audio, and video information for improved emotion recognition. • Captures long-term dependencies across multiple modalities. • Explores correlations among emotional states for improved classification. • Demonstrates effectiveness on the IEMOCAP dataset without pre-aligned data Multimodal emotion recognition is essential in affective computing, as it enables a more accurate and comprehensive understanding of human emotions by integrating diverse data modalities. However, current approaches still face key challenges, including the difficulty of handling unaligned multimodal inputs, limited ability to model long-term dependencies, and insufficient attention to relationships among emotional labels. To address these issues, this paper introduces a unified framework that combines a Pseudo-Alignment Algorithm (PAA) for processing unaligned data, a Multimodal Data Interaction Process (MDIP) for fusing text, audio, and video while preserving long-term contextual information, and a Deep Reinforcement Learning-based Emotion Detection (DRLED) model for exploring inter-emotional dependencies. Experiments conducted on the IEMOCAP benchmark dataset demonstrate that the proposed approach achieves strong emotion recognition performance without relying on pre-aligned multimodal data, highlighting its effectiveness and robustness in real-world scenarios.

A Deep Reinforcement Learning Approach for Emotion Recognition from Unaligned Multimodal Inputs

Key Points

Abstract

Cite This Study