Key points are not available for this paper at this time.
Answering questions about charts presents a unique challenge for Vision-Language Models (VLMs). Unlike natural images, charts are structured artifacts governed by explicit visual grammar that demands pixel-level accuracy in visual perception. While recent VLMs demonstrate impressive reasoning abilities on chart tasks, a critical gap remains: their reasoning operates abstractly, disconnected from precise visual grounding. We introduce Active Perception, a framework that enables Gaze-Guided Thinking, a reasoning pattern that explicitly anchors abstract inference to concrete visual locations through coordinate-based operations (Locate, Trace, Extract, Compare). To instill this capability, we propose Skill Cultivation, a two-stage training strategy: Stage I injects coordinate-aware primitives via Supervised Fine-Tuning on ChartQAGaze-14K, our synthesized dataset of 14K coordinate-annotated reasoning chains; Stage II internalizes these skills into adaptive strategies via Reinforcement Learning with outcome-based rewards. Building upon Qwen2.5-VL-7B, Active Perception achieves state-of-the-art performance on ChartQA, improving overall accuracy from 78.96% to 82.44%, with particularly notable gains on the challenging Human split (75.76% to 81.28%). Qualitative analysis reveals emergent systematic chart-reading behaviors that mirror human visual strategies, demonstrating the effectiveness of spatially grounded reasoning for structured visual understanding. • We identify a critical gap in current VLMs for chart understanding: reasoning operates abstractly without precise visual grounding, limiting accurate data extraction from structured visualizations. • We propose Gaze-Guided Thinking , a reasoning pattern that anchors abstract inference to concrete visual locations through Coordinate Primitives ( Locate , Trace , Extract , Compare ), mimicking human chart scanning behavior. • We introduce Skill Cultivation , a two-stage training strategy combining SFT on ChartQAGaze-14K (14K coordinate-annotated reasoning chains) with outcome-based RL to inject and internalize spatially-grounded reasoning. • Active Perception achieves 82.44% overall accuracy on ChartQA (a 3.48% absolute improvement), with particularly strong performance on the Human split (81.28%, +5.52%), demonstrating that explicit visual grounding substantially enhances structured visual understanding. • Qualitative analysis reveals emergent human-like chart reading behaviors, where models systematically leverage coordinates for precise value extraction and spatial reasoning.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xin Huang
Huang Yuanyuan
Yongliang Wang
Neurocomputing
Waseda University
Beijing University of Posts and Telecommunications
Shanghai Zhaozhan Metal Materials
Building similarity graph...
Analyzing shared references across papers
Loading...
Huang et al. (Fri,) studied this question.
www.synapsesocial.com/papers/6a095a877880e6d24efe07f0 — DOI: https://doi.org/10.1016/j.neucom.2026.133935