March 3, 2026Open Access

SERMixAR: Contrastive Multi-View Fusion for Robust Arabic Speech Emotion Recognition

Key Points

SERMixAR enhances speech emotion recognition accuracy by 5.5% and F1-score by 5.7%, demonstrating its robustness.
The approach integrates acoustic, linguistic, and prosodic features using a multi-view fusion technique.
Contrastive learning improves inter-class separability in emotion detection within Arabic dialects.
Dynamically adapting attention contributes to the framework's effectiveness across various dialects and noise levels.

Abstract

Speech emotion recognition (SER) systems have a great problem in handling Arabic dialects because of the complexity of the language in question, variations of prosodies, and a lack of annotated data. Current methods mainly are single-view representations which do not reflect the complex nature of expressions of emotions in Arabic speech. In this paper, the author introduces SERMixAR, a new contrastive multi-view fusion architecture that is specifically developed to identify the Arabic speech emotion. The suggested methodology combines the acoustic, linguistic and prosodic features in terms of hierarchical fusion architecture enhanced with contrastive learning. We use view-specific encoders to obtain complementary representations of emotions, and then a dynamic weighted contribution of view to the fusion module takes place through adaptive attention, which depends on the dynamical relevance of the context. The contrastive learning element improves inter-class separability at an intra-class compact embedding space. Thousand experiments on various Arabic dialect datasets demonstrate that SERMixAR outperforms the strongest baseline by 5.5% in accuracy and 5.7% in F1-score across multiple evaluation benchmarks. The framework is remarkably robust to a wide range of dialectal variations and noise acoustic environments, and sets new standards of Arabic speech emotion recognition.

SERMixAR: Contrastive Multi-View Fusion for Robust Arabic Speech Emotion Recognition

Key Points

Abstract

Cite This Study