While multimodal fake news detection methods have made progress in aligning multimodal semantics, they still face significant challenges in analyzing background context, emotional tone, and the overall plausibility of news content. To address these limitations, we propose a novel human-like collaborative framework for multimodal fake news detection, which integrates large and small models. Specifically, we exploit large vision-language models (LVLMs) to perform deep semantic analysis and reflective summarization of news cues. By leveraging the contextual understanding, knowledge recall, and logical reasoning capabilities of large models, the proposed approach improves the accuracy and reliability of fake news detection. It comprises three key components: 1) designing a chain-of-thought (CoT) prompting strategy for the LVLM to analyze news content, including evaluating image credibility, identifying potential tampering, extracting linguistic styles, detecting emotional tones, uncovering logical connections within the text, and verifying factual accuracy; 2) independently reflecting on and summarizing the lengthy analytical outputs from both image and text modalities to reduce redundancy. The resulting summary is then encoded into compact representations using pretrained text encoders and integrated with the original multimodal features; and 3) proposing a progressive fusion mechanism that enables collaboration between large and small models, allowing effective utilization of deeply fused features at the surface level. Extensive experiments conducted on three benchmark multimodal fake news datasets demonstrate the effectiveness and robustness of the proposed method, consistently outperforming state-of-the-art baselines in multimodal fake news detection tasks. The code is available at https://github.com/xxx.
Wang et al. (Thu,) studied this question.