Remote sensing visual question answering (RS-VQA) is essential to intelligent Earth observation, as it supports interactive querying of high-resolution aerial images. Many existing methods struggle with fine-detail geospatial reasoning with remote sensing (RS) scenes due to RS scenes having intrinsic multi-scale object variance and pronounced spatial heterogeneity. The models tend to rely more on the linguistic prior than reasoning based on visual evidence. In this paper, we present PMA-VQA, a progressive multi-scale feature fusion with spatially adaptive attention, to embed the RS-VQA task in spatially based hierarchical feature integration. For hierarchical, multi-level, language-informed integration, we propose a spatial attention aggregation module (SAAM) and a progressive feature fusion and classification module (PFCM). The SAAM employs spatially adaptive gating to align cross-modal features with semantic context, while the PFCM integrates multi-scale representations across high-level semantic abstractions and low-level space. The experimental results on RS-VQA LR and HR benchmarks validate that PMA-VQA outperformed all competing methods in terms of accuracy and robustness. Evaluation of HRVQA further confirmed the effectiveness of the SAAM and PFCM across diverse RS scenes.
He et al. (Fri,) studied this question.