Key points are not available for this paper at this time.
Multi-modal emotion recognition (MER) using speech and text has attracted extensive attention because of the easy availability of data for these two modalities. Recently, the self-surprised learning (SSL) pre-trained model has become the state-of-the-art (SOTA) method for the extraction of acoustic and textual features. However, the SSL speech representation may lose some important paralinguistic information, resulting in limited speech knowledge for MER. In this paper, we propose to adopt two kinds of acoustic features (i.e., the SSL representation and the spectral feature) as inputs to comprehensively extract speech characteristics. In addition, a dual cross-modal Transformer module is presented to model the interaction on the unaligned sequences between the textual feature and two acoustic features. Moreover, we introduce a blended loss including two uni-modal losses to better extract the uni-modal information. Experiments conducted on the widely used IEMOCAP dataset indicate that our proposed method achieves the SOTA performance compared with previous methods.
Wu et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: