What question did this study set out to answer?

The aim is to enhance emotion recognition systems by integrating audio and visual features in a multimodal approach.

June 26, 2026Open Access

Multimodal emotion recognition using hybrid deep feature fusion under speaker-independent evaluation

Key Points

The aim is to enhance emotion recognition systems by integrating audio and visual features in a multimodal approach.
Developed a hybrid fusion system combining handcrafted audio features and deep visual features.
Evaluated on RAVDESS and CREMA-D datasets using random-split and speaker-independent protocols.
Employed techniques including concatenation, cross-attention, gated fusion, and multiplicative fusion.
Achieved 95.83% accuracy on RAVDESS under random-split evaluation.
Attained 48.06% ± 9.76% accuracy under LOSO testing on RAVDESS.
Achieved 73.54% accuracy on CREMA-D with random splits and 53.12% ± 2.65% under 5-fold cross-validation.

Abstract

Abstract Emotion recognition is one of the most important and complex challenges for machines to understand, as most robots and AI agents struggle with human-centric perception and interpretation. Therefore, this paper introduces a novel multimodal emotion recognition system that analyzes emotions through two complementary channels: voice and facial expressions. The proposed approach is evaluated on the RAVDESS and CREMA-D datasets, which consist of acted emotional expressions across multiple discrete emotion categories. Utilizing an advanced multimodal deep feature fusion technique, the system combines handcrafted audio features (e.g., Mel-Frequency Cepstral Coefficients (MFCCs)) with deep visual features extracted from an attention-based VGGFace model. These features are integrated into a unified representation through a hybrid fusion strategy that jointly employs concatenation, cross-attention, gated fusion, and multiplicative fusion mechanisms to capture complementary cross-modal interactions. To ensure a comprehensive and realistic assessment, the model is evaluated under both random-split and strict speaker-independent protocols. On the RAVDESS dataset, the proposed system achieves an accuracy of 95.83% under random-split evaluation and 48.06% ± 9.76% accuracy under speaker-independent Leave-One-Speaker-Out (LOSO) testing, while on the CREMA-D dataset it attains 73.54% accuracy using random splits and 53.12% ± 2.65% accuracy under subject-exclusive speaker-independent 5-fold cross-validation.

Bookmark

View Full Paper

Bookmark

View Full Paper

Multimodal emotion recognition using hybrid deep feature fusion under speaker-independent evaluation

Key Points

Abstract

Cite This Study