The task of describing remote sensing imagery with natural language poses unique challenges due to the complex spatial distributions, vast semantic diversity, and varying scales of ground objects. Existing attention-based models typically rely on spatial attention over a uniform feature grid, which severely constrains the expressiveness and adaptiveness of semantic reasoning. Such fixed-grid mechanisms fail to capture subtle structures of small or irregular instances and neglect the multi-level semantic correlations across different visual hierarchies. To overcome these limitations, we introduce the Hierarchical Instance-Driven Captioning Network (HIDCap), a novel framework that adaptively learns to represent, align, and interpret multi-scale visual semantics for remote sensing image captioning. Our approach integrates two key components: (1) an instance-centric multi-hierarchy feature encoder that jointly models object-level, region-level, and global-level representations to preserve fine-grained spatial cues and contextual dependencies; and (2) a cross-level contextual attention mechanism that dynamically selects relevant semantic hierarchies at each decoding step, enabling the model to attend to salient instances and contextual backgrounds adaptively. This multi-hierarchy reasoning allows HIDCap to flexibly describe both dense urban areas and homogeneous landscapes by balancing instance-specific and holistic semantic cues. Comprehensive experiments conducted on benchmark remote sensing datasets demonstrate that our proposed framework significantly surpasses previous attention-based methods in terms of both quantitative metrics and qualitative interpretability. The proposed hierarchical instance reasoning paradigm opens new perspectives for bridging multi-scale visual understanding and linguistic generation in remote sensing captioning tasks.
Dupuis et al. (Tue,) studied this question.