What question did this study set out to answer?

This work aims to improve fake news detection by integrating textual and visual features through advanced machine learning techniques.

July 3, 2026Open Access

Multimodal social media fake news detection using RoBERTa and vision transformer encoders with reliability aware adaptive fusion

Key Points

This work aims to improve fake news detection by integrating textual and visual features through advanced machine learning techniques.
Introduced RoViT-Detect, a cross-modal framework utilizing RoBERTa and ViT-based encoders.
Integrated late fusion and MLP classifier for enhanced multimodal detection.
Incorporated adaptive fusion to mitigate modality dominance during training.
Achieved 98.93% accuracy on the Weibo dataset, 99.69% on Twitter, and 85.14% on Fakeddit.
RoViT-Detect consistently outperformed state-of-the-art multimodal methods across all datasets.

Abstract

The rapid rise of multimodal content on social media has increased the technological complexity of detecting fake news. Deceptive posts frequently pair seemingly credible textual narratives with contextually irrelevant or manipulated images, posing significant societal risks and highlighting the need for robust, automated detection mechanisms. While recent multimodal fake news detection models such as EANN, SpotFake, MCAN, LIIMR, MFND-CMM, PFBL , and TMEF-BI have shown notable improvements, many existing approaches still rely on shallow fusion strategies, suffer from modality dominance, or assume equal reliability of textual and visual information across all samples. This work introduces RoViT-Detect , a cross-modal framework that jointly models textual and visual features using RoBERTa and ViT -based encoders, integrating complementary cues through late fusion and an MLP classifier for effective multimodal fake news detection. The late-fusion design is based on the string backbone hypothesis: when unimodal encoders are sufficiently pre-trained, globally aligned CLS representations can reduce complex cross-modal attention. Furthermore, motivated by reliability-aware and uncertainty-aware fusion frameworks such as PFBL and TMEF-BI, an adaptive fusion strategy is incorporated to mitigate the influence of less informative modalities and reduce modality dominance during training. Experiments conducted on three widely used benchmark datasets- Twitter (English), Weibo (Chinese), and large-scale Fakeddit dataset covering noisy, heterogeneous, and event-driven social media environments, demonstrate that the proposed approach consistently outperforms state-of-the-art multimodal methods. The RoViT-Detect model achieves 98.93% accuracy on the Weibo dataset, 99.69% accuracy on the Twitter dataset, and 85.14% accuracy on Fakeddit dataset. These results confirm that explicitly modeling cross-modal interactions and modality reliability leads to more robust and reliable fake news detection in dynamic social media environments, establishing RoViT-Detect as a strong and scalable framework for real-world social media applications.

Ask AI

Helpful

Bookmark

View Full Paper