Remote sensing image captioning is a multimodal foundation task for fine-grained understanding of remote sensing images. However, remote sensing images contain complex scenes and rich objects, it is very challenging to accurately describe the objects in the scene with their attributes and dependencies. To address these issues, the article proposes a novel scale-aware prompting with optimal transport (SPOT) to learn effective multiscale features under diverse scenes, and to build fine-grained cross-modal alignment between semantic features and linguistic words during caption generation. Specifically, a scale-aware prompt extractor is constructed to explore feature integrations in complex scenes through learning prompts that query multi-scale features, and to enhance the representation of attributes and dependencies for objects by embedding positional relations. Besides, a fine-grained cross-modal alignment is designed to dynamically match image feature representations and textual semantics through optimal transport. Through the above manner, the model can learn effective language-aligned feature representations for caption generation. Finally, a caption Transformer with causal self-attention is introduced to generate accurate captions for remote sensing scenes. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on three public datasets, with the superiority of the proposed method further demonstrated by ablating the role of each component.
Zhang et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: