Key points are not available for this paper at this time.
Self-supervised learning (SSL) from unlabelled speech data has revolutionized speech representation learning. Among them, wavLM, wav2vec2, HuBERT, and Data2vec have produced benchmark performances on automatic speech recognition. However, few studies have explored the generalization of SSL-based representations to different tasks based on paralinguistic information in speech such as emotion recognition. This paper explores the generalization of all four popular SSL models for speech emotion recognition (SER) when trained and tested in different domains. We aim to understand how adaptable these SSL representations are when using simple domain adaptation techniques. The evaluation considers emotional speech databases that deviate in language, recording conditions, and emotional distribution, providing very different target domains. The results reveal the necessity to fine-tune the representations for the SER downstream. As the differences between the source and target domain increase, we observe that the unsupervised domain adaptation techniques are more effective. The analysis in this study provides useful insights to understand the advantages of different representations for domain adaptation in SER.
Naini et al. (Mon,) studied this question.
Synapse has enriched 4 closely related papers on similar clinical questions. Consider them for comparative context: