Remote sensing image captioning (RSIC) aims to generate natural language descriptions for the given remote sensing image, which requires a comprehensive and in-depth understanding of image content and summarizes it with sentences. Most RSIC methods have successful vision feature extraction, but the representation of spatial features or fusion features fails to fully consider cross-modal differences between remote sensing images and texts, resulting in unsatisfactory performance. Thus, we propose a novel cross-modal spatial–semantic alignment (CSSA) framework for an RSIC task, which consists of a multi-branch cross-modal contrastive learning (MCCL) mechanism and a dynamic geometry Transformer (DG-former) module. Specifically, compared to discrete text, remote sensing images present a noisy property, interfering with the extraction of valid vision features. Therefore, we present an MCCL mechanism to learn consistent representation between image and text, achieving cross-modal semantic alignment. In addition, most objects are scattered in remote sensing images and exhibit a sparsity property due to the overhead view. However, the Transformer structure mines the objects’ relationships without considering the geometry information of the objects, leading to suboptimal capture of the spatial structure. To address this, a DG-former is designed to realize spatial alignment by introducing geometry information. We conduct experiments on three publicly available datasets (Sydney-Captions, UCM-Captions and RSICD), and the superior results demonstrate its effectiveness.
Han et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: