Multimodal sentiment analysis (MSA) is essential for human-computer interaction. It combines textual, visual, and acoustic signals to improve the precision of emotion recognition. Despite recent advances, current methods still struggle with fine-grained feature fusion and insufficient modelling of complex cross-modal relationships. To address these challenges, we propose a novel Multimodal-Aware Contrastive Learning (MACL) framework. MACL proposes a Dynamic Multi-Scale Attention (DMSA) mechanism that adaptively captures multi-level temporal and spatial features within each modality, thereby enhancing the fidelity of intra-modal feature representations and improving sensitivity to subtle emotional cues. MACL incorporates a Modality-Aware Representation Learning (MARL) module that jointly learns both modality-shared and modality-specific representations, enabling the model to preserve fine-grained local details while aligning global semantic information across heterogeneous modalities. Furthermore, an Information Noise-Contrastive Estimation (InfoNCE)-based contrastive learning strategy is incorporated to maintain semantic consistency. Experimental results on the benchmark CMU Multimodal Opinion-Level Sentiment Intensity(CMU-MOSI) and CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) datasets demonstrate that MACL consistently outperforms existing state-of-the-art approaches, validating its robustness and superior generalization.
Lusha Zhu (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: