The text-only training paradigm alleviates reliance on paired data but struggles to capture geospatial structures and hierarchical semantics inherent to geospatial scenes, primarily due to the modality gap between sequential text and complex imagery. To address this limitation, we propose GeoStructCap, a novel text-only training framework for zero-shot remote sensing image captioning. Specifically, to learn visual-specific spatial structures and hierarchical semantics solely from textual supervision, we propose the structured hierarchical visual perception simulation (SHVPS) mechanism. It mimics human visual perception by decomposing textual corpora into a structured knowledge bank and employing a symmetric query strategy. Furthermore, to generate contextually accurate, semantically diverse descriptions, we design a refined knowledge-guidance (RKG) module. This module retrieves semantic priors from the structured knowledge bank to guide the caption decoder via noise-augmented retrieval and adaptive integration, thereby ensuring accurate and diverse generation. Extensive experiments conducted on multiple benchmark datasets demonstrate that GeoStructCap consistently outperforms state-of-the-art zero-shot methods and narrows the performance gap with supervised approaches. Notably, in real-world evaluations across representative regions of Shanghai, GeoStructCap achieves a 6.0% improvement in CLIP Score over the leading alternative, demonstrating its practical applicability in remote sensing image captioning.
Cheng et al. (Thu,) studied this question.