Abstract To address the challenges of insufficient multimodal feature extraction and limited cross-modal semantic diversity and interaction in multimodal sentiment analysis, this paper introduces Deep Temporal Features and Multi-Level Cross-Modal Attention Fusion (DTMCAF). Initially, a deep temporal feature extractor is developed, creating a multimodal temporal modeling network that combines bidirectional LSTMs with multi-head self-attention to capture multimodal features. Next, hierarchical cross-modal attention mechanisms along with feature-enhancement attention modules are designed to facilitate thorough information exchange between different modalities. Additionally, gated fusion and multi-layer feature transformations are employed to strengthen multimodal representations. Lastly, a multi-component collaborative loss function is proposed to align cross-modal features and optimize sentiment representations. Comprehensive experiments conducted on the CMU-MOSI and CMU-MOSEI datasets demonstrate that the proposed method outperforms current state-of-the-art techniques in terms of correlation, accuracy, and F1 score, significantly enhancing the precision of multimodal sentiment analysis.
Building similarity graph...
Analyzing shared references across papers
Loading...
Min Zhu
Shanghai Jiao Tong University
Building similarity graph...
Analyzing shared references across papers
Loading...
Min Zhu (Thu,) studied this question.
synapsesocial.com/papers/68d44b3831b076d99fa54df3 — DOI: https://doi.org/10.21203/rs.3.rs-7521327/v1