March 18, 2024Open Access

Multi-Modal Emotion Recognition Using Multiple Acoustic Features and Dual Cross-Modal Transformer

Key Points

Key points are not available for this paper at this time.

Abstract

Multi-modal emotion recognition (MER) using speech and text has attracted extensive attention because of the easy availability of data for these two modalities. Recently, the self-surprised learning (SSL) pre-trained model has become the state-of-the-art (SOTA) method for the extraction of acoustic and textual features. However, the SSL speech representation may lose some important paralinguistic information, resulting in limited speech knowledge for MER. In this paper, we propose to adopt two kinds of acoustic features (i.e., the SSL representation and the spectral feature) as inputs to comprehensively extract speech characteristics. In addition, a dual cross-modal Transformer module is presented to model the interaction on the unaligned sequences between the textual feature and two acoustic features. Moreover, we introduce a blended loss including two uni-modal losses to better extract the uni-modal information. Experiments conducted on the widely used IEMOCAP dataset indicate that our proposed method achieves the SOTA performance compared with previous methods.

Read Full Paperexternally

Demander à l'IA

Bookmark

View Full Paper