What question did this study set out to answer?

December 11, 2025Open Access

A Review of Cross-Modal Image–Text Retrieval in Remote Sensing

Key Points

To review advancements in cross-modal image-text retrieval in remote sensing and identify core challenges.
Examined mainstream representation paradigms: real-valued embedding and deep hashing.
Analyzed the influence of RS datasets on model capabilities like robustness and discriminability.
Discussed architectures and technical solutions related to RS scenarios.
Identified core challenges: semantic modeling, feature preservation, and temporal reasoning in RS.
Highlighted the importance of VLP models and high-quality image-text datasets for improved performance.

Abstract

With the emergence of large-scale vision-language pre-training (VLP) models, remote sensing (RS) image–text retrieval is shifting from global representation learning to fine-grained semantic alignment. This review systematically examines two mainstream representation paradigms—real-valued embedding and deep hashing—and analyzes how the evolution of RS datasets influences model capability, including multi-scale robustness, small object discriminability, and temporal semantic understanding. We further dissect three core challenges specific to RS scenarios: multi-scale semantic modeling, small object feature preservation, and multi-temporal reasoning. Representative architectures and technical solutions are reviewed in depth, followed by a critical discussion of their limitations in terms of generalization, evaluation consistency, and reproducibility. We also highlight the growing role of VLP-based models and the dependence of their performance on large-scale, high-quality image–text corpora. Finally, we outline future research directions, including RS-oriented VLP adaptation and unified multi-granularity evaluation frameworks. These insights aim to provide a coherent reference for advancing practical deployment and promoting cross-domain applications of RS image–text retrieval.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper