What question did this study set out to answer?

This research aims to bridge the gap in visual grounding of reasoning for interpreting charts using active perception.

May 17, 2026Open Access

Active perception: Gaze-guided thinking for chart understanding

Key Points

This research aims to bridge the gap in visual grounding of reasoning for interpreting charts using active perception.
Introduced Active Perception framework enabling gaze-guided thinking for reasoning in chart tasks.
Proposed Skill Cultivation, a two-stage training strategy combining supervised fine-tuning on ChartQAGaze-14K and reinforcement learning.
Evaluated active perception capabilities on ChartQA dataset, focusing on coordinate-based operations for improved accuracy.
Active Perception achieved an overall accuracy of 82.44% on ChartQA, a 3.48% improvement from baseline.
Performance on the challenging Human split increased to 81.28%, reflecting a 5.52% gain.
Qualitative analysis showed models mimicking human chart-reading behaviors by effectively leveraging visual coordinates.

Abstract

Answering questions about charts presents a unique challenge for Vision-Language Models (VLMs). Unlike natural images, charts are structured artifacts governed by explicit visual grammar that demands pixel-level accuracy in visual perception. While recent VLMs demonstrate impressive reasoning abilities on chart tasks, a critical gap remains: their reasoning operates abstractly, disconnected from precise visual grounding. We introduce Active Perception, a framework that enables Gaze-Guided Thinking, a reasoning pattern that explicitly anchors abstract inference to concrete visual locations through coordinate-based operations (Locate, Trace, Extract, Compare). To instill this capability, we propose Skill Cultivation, a two-stage training strategy: Stage I injects coordinate-aware primitives via Supervised Fine-Tuning on ChartQAGaze-14K, our synthesized dataset of 14K coordinate-annotated reasoning chains; Stage II internalizes these skills into adaptive strategies via Reinforcement Learning with outcome-based rewards. Building upon Qwen2.5-VL-7B, Active Perception achieves state-of-the-art performance on ChartQA, improving overall accuracy from 78.96% to 82.44%, with particularly notable gains on the challenging Human split (75.76% to 81.28%). Qualitative analysis reveals emergent systematic chart-reading behaviors that mirror human visual strategies, demonstrating the effectiveness of spatially grounded reasoning for structured visual understanding. • We identify a critical gap in current VLMs for chart understanding: reasoning operates abstractly without precise visual grounding, limiting accurate data extraction from structured visualizations. • We propose Gaze-Guided Thinking , a reasoning pattern that anchors abstract inference to concrete visual locations through Coordinate Primitives ( Locate , Trace , Extract , Compare ), mimicking human chart scanning behavior. • We introduce Skill Cultivation , a two-stage training strategy combining SFT on ChartQAGaze-14K (14K coordinate-annotated reasoning chains) with outcome-based RL to inject and internalize spatially-grounded reasoning. • Active Perception achieves 82.44% overall accuracy on ChartQA (a 3.48% absolute improvement), with particularly strong performance on the Human split (81.28%, +5.52%), demonstrating that explicit visual grounding substantially enhances structured visual understanding. • Qualitative analysis reveals emergent human-like chart reading behaviors, where models systematically leverage coordinates for precise value extraction and spatial reasoning.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper