Multimodal sentiment analysis aims to infer sentiment polarity by jointly modeling textual and visual information. Despite recent advances in pretrained language and vision encoders, sentiment prediction from social media posts remains challenging because textual and visual modalities are often weakly aligned, semantically noisy, and unevenly informative. Recent studies have emphasized the importance of fine-grained cross-modal fusion, stronger pretrained visual representations, and strategies for reducing modality bias in MVSA-style benchmarks. In this work, we present a systematic implementation-driven study of multimodal sentiment classification on MVSA-Single. We first construct a clean three-class sentiment-consistent subset and then implement a wide set of baselines, including text-only DistilBERT, image-only ResNet18, simple multimodal fusion, gated fusion, residual fusion, multi-task contrastive fusion, DINOv2-based fusion, and attention bottleneck fusion. Building on these experiments, we propose a semantic cross-modal fusion architecture that combines a RoBERTa text encoder with a CLIP vision encoder through cross-attention, allowing textual representations to selectively attend to sentiment-relevant visual signals. On the clean 2592-sample subset, the proposed model achieved the best overall performance, reaching 82.63% validation accuracy, 79.62% test accuracy and 79.42 weighted F1, outperforming all other implemented baselines under the same experimental pipeline and dataset setting. To improve comparability with prior MVSA-Single studies, we additionally reconstructed a broader processed setting from the 4511-sample HDF5 version and aligned 4318 text–image pairs with original image files. On this harder protocol-matched setting, the same model achieved 72.69% test accuracy and 70.66 weighted F1, revealing a substantial performance gap caused by dataset construction and residual multimodal noise. These findings show that strong cross-modal semantic alignment contributes more to robust multimodal sentiment prediction than simply increasing architectural complexity and that CLIP-based visual semantics are more beneficial than DINOv2 in our text–image sentiment setting.
Natarajan et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: