What question did this study set out to answer?

The study aims to enhance visual grounding in large vision language models during reasoning processes.

June 7, 2026

Progressive Visual Rationale for Multimodal Chain-of-Thought in Large Vision-Language Models

Key Points

The study aims to enhance visual grounding in large vision language models during reasoning processes.
Proposed a training-free framework called PVR-CoT for multimodal chain-of-thought reasoning.
Implemented visual attention boosting to amplify focus on image tokens during text generation.
Utilized visual token elimination to filter out irrelevant visual data in each reasoning step.
PVR-CoT improved reasoning accuracy by 1.46% for the 7 B model and 12.86% for the 13 B model compared to existing methods.
Demonstrated enhanced performance in image-grounded reasoning while maintaining computational efficiency.

Abstract

Large vision language models (LVLMs) can solve complex tasks through multimodal chain-of-thought (CoT) reasoning. However, existing CoT approaches often incur substantial computational costs and suffer from hallucinations, caused by a diminishing focus on visual data during text generation. To address these limitations, we propose a progressive visual rationale for multimodal CoT (PVR-CoT), a training-free framework designed to strengthen visual grounding during the reasoning process. PVR-CoT introduces two key mechanisms: (1) visual attention boosting, which dynamically amplifies attention weights assigned to image tokens to prevent visual information decay, and (2) visual token elimination, which progressively filters out irrelevant visual tokens to reduce noise at each reasoning step. We evaluated the proposed method on the M³CoT benchmark using the LLaVA-1.5 architecture. Experimental results demonstrate that PVR-CoT significantly improves reasoning performance, achieving accuracy gains of 1.46% and 12.86% for the 7 B and 13 B models, respectively, compared with existing baselines. These findings demonstrate that PVR-CoT effectively enhances image-grounded reasoning while maintaining efficiency, offering a practical solution for improving multimodal CoT without additional training.

Bookmark

Cite This Study

Lim et al. (Sun,) studied this question.

synapsesocial.com/papers/6a250a3c7def13d035e1a64e https://doi.org/https://doi.org/10.15323/techart.2026.5.13.2.20

Bookmark