March 3, 2026Open Access

Cross-corpus language-independent speech emotion recognition using hybrid deep learning framework

Key Points

Models achieve up to 69% accuracy in emotion recognition, exceeding baseline metrics.
The MLP+LSTM trained on German data achieved a 12.5% increase in accuracy on Urdu compared to baseline.
Deep and hybrid learning frameworks employed include ANN, MLP+LSTM, RF+DNN, and a custom Transformer.
Models demonstrate improved balance between precision and recall when compared to classical methods.

Abstract

Speech Emotion Recognition (SER) plays a critical role in human-computer interaction, enabling systems to understand and respond to users more naturally. However, developing robust SER systems that generalize across languages remains a significant challenge due to linguistic and cultural variations. This study employs several deep and hybrid learning models to conduct a comprehensive assessment of cross-lingual SER and address the challenge of emotion recognition in speech across diverse cultures. Four architectures: Artificial Neural Network (ANN), Multi-Layer Perceptron + Long Short-Term Memory (MLP+LSTM), Random Forest + Deep Neural Network (RF+DNN), and our custom Transformer, are employed using various combinations of Urdu, English, German, and Italian corpora of speech data for emotion recognition. Rather than relying on standard machine learning classifiers, our models are better at handling various languages. After conducting experiments, we found that our models outperform the baseline methods across cross-lingual settings. In particular, the model trained on Italian data and tested on Urdu achieved a top result of 69%, surpassing the baseline’s highest score of 62.5%. Similarly, the MLP+LSTM trained in German achieved 65% accuracy on Urdu, which was 12.5% higher than the baseline. We found that using our best-trained model results in a better balance of precision and recall than the baseline, with 58.9% versus 60.2%. With the difficult Urdu-to-English task, our Transformer model scored 0.4381 on F1, which is not far from the baseline’s highest score of 0.44. They demonstrate that deep and hybrid models can identify emotional aspects that are consistent across languages. According to the results, deep learning techniques are effective for SER, whereas classical methods tend to perform poorly across languages. Efforts are being made to explore domain adaptation, utilize multiple languages in pretraining, and recognize emotions based on various input modalities to elevate cross-lingual performance.

Cross-corpus language-independent speech emotion recognition using hybrid deep learning framework

Key Points

Abstract

Cite This Study