What question did this study set out to answer?

The aim is to enhance the retrieval of remote sensing images and texts by improving cross-modal alignment and representation discrimination.

February 24, 2026Open Access

Towards Discriminative and Consistent Cross-Modal Alignment for Remote Sensing Image–Text Retrieval

Key Points

The aim is to enhance the retrieval of remote sensing images and texts by improving cross-modal alignment and representation discrimination.
Introduced a novel framework called DCCA.
Developed a global contrastive learning strategy with negative pair expansion.
Implemented a bidirectional distribution matching constraint for better alignment of distributions.
Proposed a remote sensing information injection module to improve visual discriminability.
DCCA outperformed baseline methods on RSITR benchmarks.
Achieved performance comparable to models using large-scale datasets with less data.
Demonstrated effective domain adaptation in remote sensing contexts.

Abstract

As large-scale remote sensing data continue to proliferate, research on remote sensing image–text retrieval (RSITR) has become progressively more prominent. Nevertheless, RSITR still faces two primary challenges. First, remote sensing data exhibit substantially higher intra-modal similarity than typical natural image–text corpora, complicating the discrimination of positive and negative pairs. Second, vision–language models pretrained on natural images (VLP), such as CLIP, are not readily adaptable to remote sensing scenarios without undergoing large-scale remote sensing pretraining that entails substantial cost. To tackle these challenges, we introduce DCCA, a novel framework designed for discriminative and consistent cross-modal alignment. We develop a global contrastive learning strategy with negative pair expansion mechanism to boost representation discrimination when intra-modal similarity is pronounced. Additionally, we introduce a bidirectional distribution matching constraint that jointly aligns intra- and inter-modal distributions, promoting consistent cross-modal alignment beyond the instance level. To further enhance domain adaptation, we propose a remote sensing information injection module that transfers knowledge from a pretrained remote sensing image recognition model into VLP, thereby improving its visual discriminability in remote sensing scenarios. Evaluations conducted on publicly available RSITR benchmarks indicate that DCCA consistently surpasses baseline methods, while attaining performance on par with models trained using large-scale remote sensing datasets under markedly reduced data requirements. These findings verify that the framework is both effective and well-suited for practical deployment.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper