Multimodal sentiment analysis (MSA) has become a key research area in artificial intelligence, aiming to predict sentiment polarity or intensity by jointly modeling text, audio, and visual information. However, traditional methods still face significant challenges due to inherent heterogeneity among modalities, semantic representation discrepancies, and insufficient cross-modal interaction. To address these issues, this paper proposes a multimodal sentiment classification model that integrates bidirectional cross-modal attention with multi-level constraint optimization. Specifically, a unified multimodal feature encoding (UMFE) module combining BiLSTM and transformer architectures is first constructed to jointly model and extract robust unimodal representations from text, audio, and visual modalities, thereby enhancing both robustness and discriminative ability. On this basis, we introduce a bidirectional cross-modal attention mechanism, which performs Query–Key attention between modalities, enabling each modality to selectively aggregate complementary information and capture cross-modal semantic dependencies. Furthermore, a cross-modal re-fusion transformer (HMRT) module treats the textual modality as dominant to guide the deep fusion of high-level semantic features after cross-modal interaction, producing a compact unified representation. Finally, a multi-task joint optimization framework with uncertainty-based adaptive weighting dynamically balances unimodal supervision loss, cross-modal consistency loss, and sentiment classification loss, which helps improve representation learning and generalization ability.
Wu et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: