March 18, 2024Open Access

Generalization of Self-Supervised Learning-Based Representations for Cross-Domain Speech Emotion Recognition

Key Points

Key points are not available for this paper at this time.

Abstract

Self-supervised learning (SSL) from unlabelled speech data has revolutionized speech representation learning. Among them, wavLM, wav2vec2, HuBERT, and Data2vec have produced benchmark performances on automatic speech recognition. However, few studies have explored the generalization of SSL-based representations to different tasks based on paralinguistic information in speech such as emotion recognition. This paper explores the generalization of all four popular SSL models for speech emotion recognition (SER) when trained and tested in different domains. We aim to understand how adaptable these SSL representations are when using simple domain adaptation techniques. The evaluation considers emotional speech databases that deviate in language, recording conditions, and emotional distribution, providing very different target domains. The results reveal the necessity to fine-tune the representations for the SER downstream. As the differences between the source and target domain increase, we observe that the unsupervised domain adaptation techniques are more effective. The analysis in this study provides useful insights to understand the advantages of different representations for domain adaptation in SER.

Generalization of Self-Supervised Learning-Based Representations for Cross-Domain Speech Emotion Recognition

Key Points

Abstract

Cite This Study

Also Consider

Also Consider