What question did this study set out to answer?

The research focuses on developing a framework that improves the generation of natural language descriptions for remote sensing images by aligning spatial and semantic features.

February 8, 2026Open Access

CSSA: A Cross-Modal Spatial–Semantic Alignment Framework for Remote Sensing Image Captioning

Key Points

The research focuses on developing a framework that improves the generation of natural language descriptions for remote sensing images by aligning spatial and semantic features.
Developed a cross-modal spatial–semantic alignment framework (CSSA).
Introduced a multi-branch cross-modal contrastive learning (MCCL) mechanism to align image and text representations.
Designed a dynamic geometry Transformer (DG-former) to incorporate geometry information and improve spatial alignment.
Conducted experiments on three datasets: Sydney-Captions, UCM-Captions, and RSICD.
Demonstrated improved performance in remote sensing image captioning compared to existing methods.
Achieved better semantic alignment between images and captions.
Shown effectiveness in handling the spatial sparsity and geometric relationships in remote sensing images.

Abstract

Remote sensing image captioning (RSIC) aims to generate natural language descriptions for the given remote sensing image, which requires a comprehensive and in-depth understanding of image content and summarizes it with sentences. Most RSIC methods have successful vision feature extraction, but the representation of spatial features or fusion features fails to fully consider cross-modal differences between remote sensing images and texts, resulting in unsatisfactory performance. Thus, we propose a novel cross-modal spatial–semantic alignment (CSSA) framework for an RSIC task, which consists of a multi-branch cross-modal contrastive learning (MCCL) mechanism and a dynamic geometry Transformer (DG-former) module. Specifically, compared to discrete text, remote sensing images present a noisy property, interfering with the extraction of valid vision features. Therefore, we present an MCCL mechanism to learn consistent representation between image and text, achieving cross-modal semantic alignment. In addition, most objects are scattered in remote sensing images and exhibit a sparsity property due to the overhead view. However, the Transformer structure mines the objects’ relationships without considering the geometry information of the objects, leading to suboptimal capture of the spatial structure. To address this, a DG-former is designed to realize spatial alignment by introducing geometry information. We conduct experiments on three publicly available datasets (Sydney-Captions, UCM-Captions and RSICD), and the superior results demonstrate its effectiveness.

AIに質問

Bookmark

View Full Paper