Answering questions about charts presents a unique challenge for Vision-Language Models (VLMs). Unlike natural images, charts are structured artifacts governed by explicit visual grammar that demands pixel-level accuracy in visual perception. While recent VLMs demonstrate impressive reasoning abilities on chart tasks, a critical gap remains: their reasoning operates abstractly, disconnected from precise visual grounding. We introduce Active Perception, a framework that enables Gaze-Guided Thinking, a reasoning pattern that explicitly anchors abstract inference to concrete visual locations through coordinate-based operations (Locate, Trace, Extract, Compare). To instill this capability, we propose Skill Cultivation, a two-stage training strategy: Stage I injects coordinate-aware primitives via Supervised Fine-Tuning on ChartQAGaze-14K, our synthesized dataset of 14K coordinate-annotated reasoning chains; Stage II internalizes these skills into adaptive strategies via Reinforcement Learning with outcome-based rewards. Building upon Qwen2.5-VL-7B, Active Perception achieves state-of-the-art performance on ChartQA, improving overall accuracy from 78.96% to 82.44%, with particularly notable gains on the challenging Human split (75.76% to 81.28%). Qualitative analysis reveals emergent systematic chart-reading behaviors that mirror human visual strategies, demonstrating the effectiveness of spatially grounded reasoning for structured visual understanding. • We identify a critical gap in current VLMs for chart understanding: reasoning operates abstractly without precise visual grounding, limiting accurate data extraction from structured visualizations. • We propose Gaze-Guided Thinking , a reasoning pattern that anchors abstract inference to concrete visual locations through Coordinate Primitives ( Locate , Trace , Extract , Compare ), mimicking human chart scanning behavior. • We introduce Skill Cultivation , a two-stage training strategy combining SFT on ChartQAGaze-14K (14K coordinate-annotated reasoning chains) with outcome-based RL to inject and internalize spatially-grounded reasoning. • Active Perception achieves 82.44% overall accuracy on ChartQA (a 3.48% absolute improvement), with particularly strong performance on the Human split (81.28%, +5.52%), demonstrating that explicit visual grounding substantially enhances structured visual understanding. • Qualitative analysis reveals emergent human-like chart reading behaviors, where models systematically leverage coordinates for precise value extraction and spatial reasoning.
Huang et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: