August 7, 2025

C-MMCOT: Multimodal Chain-of-Thought Reasoning Using CLIP Features

Key Points

C-MMCoT enhances multimodal chain-of-thought reasoning by integrating CLIP-extracted visual features.
Experiments on the ScienceQA test set show that C-MMCoT outperforms baseline models in key accuracy metrics.
The two-stage architecture separates rationale generation from answer inference for better reasoning.
C-MMCoT achieves a 0.57 percentage point increase in overall accuracy compared to GPT-4, highlighting its significance.

Abstract

Chain-of-Thought (CoT) reasoning enhances the performance of large language models (LLMs) on complex tasks such as solving mathematical problems, logical inference, and question answering by guiding models to generate intermediate reasoning steps rather than directly producing final answers. This approach simulates human-like, step-by-step thinking, significantly improving the stability and accuracy of the reasoning process. By moving beyond the black box" nature of traditional LLM outputs, CoT also lays the foundation for more controllable and multimodal reasoning. However, most existing research has focused on unimodal (text-only) CoT, leaving the multimodal setting underexplored. Multimodal CoT (MMCoT) addresses this gap by separating rationale generation and answer inference through a two-stage architecture that integrates visual and textual inputs. However, due to the limited semantic richness of visual features extracted by the Vision Transformer (ViT), its performance remains suboptimal. In this work, we propose C-MMCoT, a model that leverages CLIP-extracted visual features to generate rationales, thereby enhancing the semantic alignment of visual reasoning. Experiments on the ScienceQA test set demonstrate that C-MMCoT outperforms baseline models. Compared to GPT-4, it achieves higher accuracy on key categories such as SOC, TXT, and IMG, culminating in an overall accuracy that is 0.57 percentage points higher.

Mark Helpful

Bookmark

Relay