With the emergence of large-scale vision-language pre-training (VLP) models, remote sensing (RS) image–text retrieval is shifting from global representation learning to fine-grained semantic alignment. This review systematically examines two mainstream representation paradigms—real-valued embedding and deep hashing—and analyzes how the evolution of RS datasets influences model capability, including multi-scale robustness, small object discriminability, and temporal semantic understanding. We further dissect three core challenges specific to RS scenarios: multi-scale semantic modeling, small object feature preservation, and multi-temporal reasoning. Representative architectures and technical solutions are reviewed in depth, followed by a critical discussion of their limitations in terms of generalization, evaluation consistency, and reproducibility. We also highlight the growing role of VLP-based models and the dependence of their performance on large-scale, high-quality image–text corpora. Finally, we outline future research directions, including RS-oriented VLP adaptation and unified multi-granularity evaluation frameworks. These insights aim to provide a coherent reference for advancing practical deployment and promoting cross-domain applications of RS image–text retrieval.
Xu et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: