Key points are not available for this paper at this time.
Speech emotion recognition (SER) is challenging owing to the complexity of emotional representation. Hence, this article focuses on multimodal speech emotion recognition that analyzes the speaker’s sentiment state via audio signals and textual content. Existing multimodal approaches utilize sequential networks to capture the temporal dependency in various feature sequences, ignoring the underlying relations in acoustic and textual modalities. Moreover, current feature-level and decision-level fusion methods have unresolved limitations. Therefore, this paper develops a novel multimodal fusion graph convolutional network that comprehensively executes information interactions within and between the two modalities. Specifically, we construct the intra-modal relations to excavate exclusive intrinsic characteristics in each modality. For the inter-modal fusion, a multi-perspective fusion mechanism is devised to integrate the complementary information between the two modalities. Substantial experiments on the IEMOCAP and RAVDESS datasets and experimental results demonstrate that our approach achieves superior performance. • Develop a multimodal fusion graph convolutional network to execute the intra- and inter-modal interactions. • Excavate the sentiment, semantic, and temporal dependency to construct the intra-modal relations. • Devise a multi-perspective fusion mechanism for inter-modal fusion. • Adopt a multi-angle loss to optimize the model.
Building similarity graph...
Analyzing shared references across papers
Loading...
Qi et al. (Tue,) studied this question.
synapsesocial.com/papers/69ff4aab4716aad0cc85479e — DOI: https://doi.org/10.1016/j.neucom.2024.128646
Xin Qi
University of Technology Sydney
Yujun Wen
Communication University of China
Pengzhou Zhang
Communication University of China
Neurocomputing
Beijing Institute of Technology
Communication University of China
Building similarity graph...
Analyzing shared references across papers
Loading...