What question did this study set out to answer?

The research aims to improve visual reasoning in remote sensing visual question answering (RS-VQA) by employing progressive multi-scale feature fusion with spatially adaptive attention.

April 12, 2026Open Access

PMA-VQA: Progressive Multi-Scale Feature Fusion with Spatially Adaptive Attention for Remote Sensing Visual Question Answering

Key Points

The research aims to improve visual reasoning in remote sensing visual question answering (RS-VQA) by employing progressive multi-scale feature fusion with spatially adaptive attention.
Developed PMA-VQA integrating spatially adaptive attention with hierarchical feature aggregation.
Introduced a spatial attention aggregation module (SAAM) for aligning features with semantic context.
Proposed a progressive feature fusion and classification module (PFCM) for multi-scale representation integration.
PMA-VQA outperformed competing methods in accuracy and robustness on RS-VQA LR and HR benchmarks.
Evaluation showed significant enhancements in visual reasoning across diverse remote sensing scenes.

Abstract

Remote sensing visual question answering (RS-VQA) is essential to intelligent Earth observation, as it supports interactive querying of high-resolution aerial images. Many existing methods struggle with fine-detail geospatial reasoning with remote sensing (RS) scenes due to RS scenes having intrinsic multi-scale object variance and pronounced spatial heterogeneity. The models tend to rely more on the linguistic prior than reasoning based on visual evidence. In this paper, we present PMA-VQA, a progressive multi-scale feature fusion with spatially adaptive attention, to embed the RS-VQA task in spatially based hierarchical feature integration. For hierarchical, multi-level, language-informed integration, we propose a spatial attention aggregation module (SAAM) and a progressive feature fusion and classification module (PFCM). The SAAM employs spatially adaptive gating to align cross-modal features with semantic context, while the PFCM integrates multi-scale representations across high-level semantic abstractions and low-level space. The experimental results on RS-VQA LR and HR benchmarks validate that PMA-VQA outperformed all competing methods in terms of accuracy and robustness. Evaluation of HRVQA further confirmed the effectiveness of the SAAM and PFCM across diverse RS scenes.

PMA-VQA: Progressive Multi-Scale Feature Fusion with Spatially Adaptive Attention for Remote Sensing Visual Question Answering

Key Points

Abstract

Cite This Study