Large vision language models (LVLMs) can solve complex tasks through multimodal chain-of-thought (CoT) reasoning. However, existing CoT approaches often incur substantial computational costs and suffer from hallucinations, caused by a diminishing focus on visual data during text generation. To address these limitations, we propose a progressive visual rationale for multimodal CoT (PVR-CoT), a training-free framework designed to strengthen visual grounding during the reasoning process. PVR-CoT introduces two key mechanisms: (1) visual attention boosting, which dynamically amplifies attention weights assigned to image tokens to prevent visual information decay, and (2) visual token elimination, which progressively filters out irrelevant visual tokens to reduce noise at each reasoning step. We evaluated the proposed method on the M³CoT benchmark using the LLaVA-1.5 architecture. Experimental results demonstrate that PVR-CoT significantly improves reasoning performance, achieving accuracy gains of 1.46% and 12.86% for the 7 B and 13 B models, respectively, compared with existing baselines. These findings demonstrate that PVR-CoT effectively enhances image-grounded reasoning while maintaining efficiency, offering a practical solution for improving multimodal CoT without additional training.
Lim et al. (Sun,) studied this question.