ABSTRACT This study introduces GVSABench, a comprehensive Geo‐Visuospatial Ability (GVSA) benchmark for evaluating the spatial abilities of multimodal large language models (MLLMs). The benchmark systematically spans intrinsic and extrinsic, static and dynamic, and geographic and non‐geographic dimensions, comprising 851 image‐based tasks. These tasks cover a variety of tasks including spatial visualization, spatial relation reasoning, scene interpretation, spatial orientation and localization, and map‐based problem‐solving. Seven state‐of‐the‐art MLLMs were tested under zero‐shot, zero‐CoT, and one‐shot prompting strategies. Results indicate that overall accuracies remain moderate to low, with significant variability across models, languages, and task types. Prompting strategies yield only limited improvements, underscoring that engineering alone cannot compensate for fundamental deficits in spatial cognition. Moreover, a scale‐separation effect was observed, with distinct performance patterns between geographic and non‐geographic tasks, as well as between small‐ and large‐scale contexts. These findings reveal the incomplete integration of visual, linguistic, and spatial reasoning in current MLLMs. GVSABench offers a reproducible and cognitively grounded framework for advancing future research on robust and human‐aligned spatial intelligence.
Building similarity graph...
Analyzing shared references across papers
Loading...
Can Liu
Zhiwei Wei
Hua Liao
Transactions in GIS
Beijing Normal University
Hunan Normal University
Building similarity graph...
Analyzing shared references across papers
Loading...
Liu et al. (Sun,) studied this question.
www.synapsesocial.com/papers/698d6e5a5be6419ac0d54087 — DOI: https://doi.org/10.1111/tgis.70189