What question did this study set out to answer?

The research aims to explore effective multimodal fusion techniques for improving sentiment analysis accuracy on social media posts.

May 7, 2026Open Access

Text-Anchored Residual Cross-Modal Fusion for Multimodal Sentiment Analysis: A Unified and Protocol-Aware Evaluation on MVSA-Single

Key Points

The research aims to explore effective multimodal fusion techniques for improving sentiment analysis accuracy on social media posts.
Implemented multiple baseline models including DistilBERT and ResNet18
Developed a semantic cross-modal fusion architecture with RoBERTa and CLIP
Evaluated models on both a clean subset and a broader reconstructed setting
Achieved 82.63% validation accuracy and 79.42 weighted F1 on the clean subset
The model outperformed all baselines in a consistent experimental setup
Indicated significant performance degradation due to dataset construction and noise in stricter conditions

Abstract

Multimodal sentiment analysis aims to infer sentiment polarity by jointly modeling textual and visual information. Despite recent advances in pretrained language and vision encoders, sentiment prediction from social media posts remains challenging because textual and visual modalities are often weakly aligned, semantically noisy, and unevenly informative. Recent studies have emphasized the importance of fine-grained cross-modal fusion, stronger pretrained visual representations, and strategies for reducing modality bias in MVSA-style benchmarks. In this work, we present a systematic implementation-driven study of multimodal sentiment classification on MVSA-Single. We first construct a clean three-class sentiment-consistent subset and then implement a wide set of baselines, including text-only DistilBERT, image-only ResNet18, simple multimodal fusion, gated fusion, residual fusion, multi-task contrastive fusion, DINOv2-based fusion, and attention bottleneck fusion. Building on these experiments, we propose a semantic cross-modal fusion architecture that combines a RoBERTa text encoder with a CLIP vision encoder through cross-attention, allowing textual representations to selectively attend to sentiment-relevant visual signals. On the clean 2592-sample subset, the proposed model achieved the best overall performance, reaching 82.63% validation accuracy, 79.62% test accuracy and 79.42 weighted F1, outperforming all other implemented baselines under the same experimental pipeline and dataset setting. To improve comparability with prior MVSA-Single studies, we additionally reconstructed a broader processed setting from the 4511-sample HDF5 version and aligned 4318 text–image pairs with original image files. On this harder protocol-matched setting, the same model achieved 72.69% test accuracy and 70.66 weighted F1, revealing a substantial performance gap caused by dataset construction and residual multimodal noise. These findings show that strong cross-modal semantic alignment contributes more to robust multimodal sentiment prediction than simply increasing architectural complexity and that CLIP-based visual semantics are more beneficial than DINOv2 in our text–image sentiment setting.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper

Cite This Study

Natarajan et al. (Mon,) studied this question.

synapsesocial.com/papers/69fbefef164b5133a91a4188 https://doi.org/https://doi.org/10.3390/app16094514

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

AI에게 질문

Bookmark

View Full Paper