Abstract Emotion recognition is one of the most important and complex challenges for machines to understand, as most robots and AI agents struggle with human-centric perception and interpretation. Therefore, this paper introduces a novel multimodal emotion recognition system that analyzes emotions through two complementary channels: voice and facial expressions. The proposed approach is evaluated on the RAVDESS and CREMA-D datasets, which consist of acted emotional expressions across multiple discrete emotion categories. Utilizing an advanced multimodal deep feature fusion technique, the system combines handcrafted audio features (e.g., Mel-Frequency Cepstral Coefficients (MFCCs)) with deep visual features extracted from an attention-based VGGFace model. These features are integrated into a unified representation through a hybrid fusion strategy that jointly employs concatenation, cross-attention, gated fusion, and multiplicative fusion mechanisms to capture complementary cross-modal interactions. To ensure a comprehensive and realistic assessment, the model is evaluated under both random-split and strict speaker-independent protocols. On the RAVDESS dataset, the proposed system achieves an accuracy of 95.83% under random-split evaluation and 48.06% ± 9.76% accuracy under speaker-independent Leave-One-Speaker-Out (LOSO) testing, while on the CREMA-D dataset it attains 73.54% accuracy using random splits and 53.12% ± 2.65% accuracy under subject-exclusive speaker-independent 5-fold cross-validation.
Ibrahim et al. (Wed,) studied this question.