Speech emotion recognition (SER) systems have a great problem in handling Arabic dialects because of the complexity of the language in question, variations of prosodies, and a lack of annotated data. Current methods mainly are single-view representations which do not reflect the complex nature of expressions of emotions in Arabic speech. In this paper, the author introduces SERMixAR, a new contrastive multi-view fusion architecture that is specifically developed to identify the Arabic speech emotion. The suggested methodology combines the acoustic, linguistic and prosodic features in terms of hierarchical fusion architecture enhanced with contrastive learning. We use view-specific encoders to obtain complementary representations of emotions, and then a dynamic weighted contribution of view to the fusion module takes place through adaptive attention, which depends on the dynamical relevance of the context. The contrastive learning element improves inter-class separability at an intra-class compact embedding space. Thousand experiments on various Arabic dialect datasets demonstrate that SERMixAR outperforms the strongest baseline by 5.5% in accuracy and 5.7% in F1-score across multiple evaluation benchmarks. The framework is remarkably robust to a wide range of dialectal variations and noise acoustic environments, and sets new standards of Arabic speech emotion recognition.
Bouchelligua et al. (Thu,) studied this question.