Los puntos clave no están disponibles para este artículo en este momento.
In this paper we present our findings on how representation learning on large unlabeled speech corpora can be beneficially utilized for speech emotion recognition (SER). Prior work on representation learning for SER mostly focused on the relatively small emotional speech datasets without making use of additional unlabeled speech data. We show that integrating representations learnt by an unsupervised autoencoder into a CNN-based emotion classifier improves the recognition accuracy. To gain insights about what those models learn, we analyze visualizations of the different representations using t-distributed neighbor embeddings (t-SNE). We evaluate our approach on IEMOCAP and MSP-IMPROV by means of within- and cross-corpus testing.
Neumann et al. (Wed,) studied this question.