The rapid rise of multimodal content on social media has increased the technological complexity of detecting fake news. Deceptive posts frequently pair seemingly credible textual narratives with contextually irrelevant or manipulated images, posing significant societal risks and highlighting the need for robust, automated detection mechanisms. While recent multimodal fake news detection models such as EANN, SpotFake, MCAN, LIIMR, MFND-CMM, PFBL , and TMEF-BI have shown notable improvements, many existing approaches still rely on shallow fusion strategies, suffer from modality dominance, or assume equal reliability of textual and visual information across all samples. This work introduces RoViT-Detect , a cross-modal framework that jointly models textual and visual features using RoBERTa and ViT -based encoders, integrating complementary cues through late fusion and an MLP classifier for effective multimodal fake news detection. The late-fusion design is based on the string backbone hypothesis: when unimodal encoders are sufficiently pre-trained, globally aligned CLS representations can reduce complex cross-modal attention. Furthermore, motivated by reliability-aware and uncertainty-aware fusion frameworks such as PFBL and TMEF-BI, an adaptive fusion strategy is incorporated to mitigate the influence of less informative modalities and reduce modality dominance during training. Experiments conducted on three widely used benchmark datasets- Twitter (English), Weibo (Chinese), and large-scale Fakeddit dataset covering noisy, heterogeneous, and event-driven social media environments, demonstrate that the proposed approach consistently outperforms state-of-the-art multimodal methods. The RoViT-Detect model achieves 98.93% accuracy on the Weibo dataset, 99.69% accuracy on the Twitter dataset, and 85.14% accuracy on Fakeddit dataset. These results confirm that explicitly modeling cross-modal interactions and modality reliability leads to more robust and reliable fake news detection in dynamic social media environments, establishing RoViT-Detect as a strong and scalable framework for real-world social media applications.
Bhukya et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: