Key points are not available for this paper at this time.
Real-time emotion recognition in conversations (ERC), which relies on only the historical utterances to achieve ERC, has recently gained increasing attention due to its significance in providing real-time empathetic services. Although utilizing multimodal information can mitigate the issues of unimodal approaches, few real-time ERC studies consider the differences in representation ability of different modalities and explore comprehensive conversational context from different perspectives based on different structures. Furthermore, the heavy annotation cost makes it difficult to collect sufficient labeled data, which also limits the performance of current supervised ERC approaches. To address these issues, we propose a novel framework SMFNM for real-time ERC, which integrates semi-supervised learning with multimodal fusion under the guidance of main-modal. Specifically, SMFNM utilizes additional unlabeled data to extract high-quality intra-modal representations, and implements cross-modal interaction to capture complementary information to enhance the audio representations. Then SMFNM employs the directed acyclic graph and the Gated Recurrent Units for exploring more accurate conversational context from both the multimodal and main-modal perspectives, respectively. Finally, these two types of contextual features are fused for emotion identification. Extensive experiments on benchmark datasets (i.e., IEMOCAP (4-way), IEMOCAP (6-way) and MELD) demonstrate the effectiveness, superiority and rationality of our SMFNM.
Yang et al. (Sun,) studied this question.
Synapse has enriched 4 closely related papers on similar clinical questions. Consider them for comparative context: