As large-scale remote sensing data continue to proliferate, research on remote sensing image–text retrieval (RSITR) has become progressively more prominent. Nevertheless, RSITR still faces two primary challenges. First, remote sensing data exhibit substantially higher intra-modal similarity than typical natural image–text corpora, complicating the discrimination of positive and negative pairs. Second, vision–language models pretrained on natural images (VLP), such as CLIP, are not readily adaptable to remote sensing scenarios without undergoing large-scale remote sensing pretraining that entails substantial cost. To tackle these challenges, we introduce DCCA, a novel framework designed for discriminative and consistent cross-modal alignment. We develop a global contrastive learning strategy with negative pair expansion mechanism to boost representation discrimination when intra-modal similarity is pronounced. Additionally, we introduce a bidirectional distribution matching constraint that jointly aligns intra- and inter-modal distributions, promoting consistent cross-modal alignment beyond the instance level. To further enhance domain adaptation, we propose a remote sensing information injection module that transfers knowledge from a pretrained remote sensing image recognition model into VLP, thereby improving its visual discriminability in remote sensing scenarios. Evaluations conducted on publicly available RSITR benchmarks indicate that DCCA consistently surpasses baseline methods, while attaining performance on par with models trained using large-scale remote sensing datasets under markedly reduced data requirements. These findings verify that the framework is both effective and well-suited for practical deployment.
Song et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: