What question did this study set out to answer?

This research aims to develop a method to improve caption generation for remote sensing images.

May 7, 2026

Scale-aware Prompting with Optimal Transport for Remote Sensing Image Captioning

Key Points

This research aims to develop a method to improve caption generation for remote sensing images.
Introduced a scale-aware prompting mechanism using optimal transport.
Constructed a scale-aware prompt extractor for multi-scale feature querying.
Designed a cross-modal alignment strategy for matching image features with semantic representations.
Implemented a caption Transformer with causal self-attention for generating captions.
Achieved state-of-the-art performance on three public datasets.
Demonstrated improved alignment of semantic features with generated captions.

Abstract

Remote sensing image captioning is a multimodal foundation task for fine-grained understanding of remote sensing images. However, remote sensing images contain complex scenes and rich objects, it is very challenging to accurately describe the objects in the scene with their attributes and dependencies. To address these issues, the article proposes a novel scale-aware prompting with optimal transport (SPOT) to learn effective multiscale features under diverse scenes, and to build fine-grained cross-modal alignment between semantic features and linguistic words during caption generation. Specifically, a scale-aware prompt extractor is constructed to explore feature integrations in complex scenes through learning prompts that query multi-scale features, and to enhance the representation of attributes and dependencies for objects by embedding positional relations. Besides, a fine-grained cross-modal alignment is designed to dynamically match image feature representations and textual semantics through optimal transport. Through the above manner, the model can learn effective language-aligned feature representations for caption generation. Finally, a caption Transformer with causal self-attention is introduced to generate accurate captions for remote sensing scenes. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on three public datasets, with the superiority of the proposed method further demonstrated by ablating the role of each component.

Ask AI

Mark Helpful

Bookmark

Relay