What question did this study set out to answer?

This research aims to improve the transferability of vision-language models for remote sensing image-text retrieval by addressing semantic shifts.

May 14, 2026

ReCoTR: Reducing Semantic Cognitive Shift via Dual-Consensus Token Compression for Remote Sensing Image-Text Retrieval

Key Points

This research aims to improve the transferability of vision-language models for remote sensing image-text retrieval by addressing semantic shifts.
Proposed ReCoTR framework utilizes Dual Consensus Token Evaluation for semantic alignment.
Implemented Semantic Confidence Token Compression to filter high-relevance tokens and reduce noise.
Evaluated on three benchmark remote sensing datasets to assess retrieval performance.
ReCoTR outperformed existing methods on all tested datasets for bidirectional image-text retrieval.
Significant improvement in semantic alignment with reduced redundancy and noise in token representation.

Abstract

With the rapid advancement of vision-language models (VLMs) in general-purpose settings, their application to cross-modal retrieval and semantic understanding of large-scale multimodal remote sensing (RS) data is emerging as a key enabler for urban governance, environmental monitoring, and disaster response. However, the pervasive issue of semantic shift in RS image poses a significant challenge to the transferability of pre-trained VLMs. To address this limitation, we propose ReCoTR, an enhanced CLIP-based cross-modal retrieval framework tailored for remote sensing applications. ReCoTR tackles region-level granularity bias and contextual semantic drift through a Dual Consensus Token Evaluation (DCTE) module, which leverages a mixture-of-experts strategy to fuse inter-modal semantic consensus with intra-modal structural consistency, enabling fine-grained estimation of semantic confidence for visual tokens. Moreover, to mitigate representational contamination caused by background noise, we introduce the Semantic Confidence Token Compression (SCTC) module. This module selectively filters and aggregates tokens with high semantic relevance, thus reducing redundancy and alleviating the noise amplification inherent in CLIP's average pooling. Experimental results on three benchmark RS cross-modal retrieval datasets demonstrate that ReCoTR consistently outperforms existing methods on bidirectional image-text retrieval tasks, validating its effectiveness and robustness in remote sensing semantic alignment scenarios. Our source codes are available at: https://github.com/Jerry710/ReCoTR.git.

Ask AI

Mark Helpful

Bookmark

Relay