With the rapid advancement of vision-language models (VLMs) in general-purpose settings, their application to cross-modal retrieval and semantic understanding of large-scale multimodal remote sensing (RS) data is emerging as a key enabler for urban governance, environmental monitoring, and disaster response. However, the pervasive issue of semantic shift in RS image poses a significant challenge to the transferability of pre-trained VLMs. To address this limitation, we propose ReCoTR, an enhanced CLIP-based cross-modal retrieval framework tailored for remote sensing applications. ReCoTR tackles region-level granularity bias and contextual semantic drift through a Dual Consensus Token Evaluation (DCTE) module, which leverages a mixture-of-experts strategy to fuse inter-modal semantic consensus with intra-modal structural consistency, enabling fine-grained estimation of semantic confidence for visual tokens. Moreover, to mitigate representational contamination caused by background noise, we introduce the Semantic Confidence Token Compression (SCTC) module. This module selectively filters and aggregates tokens with high semantic relevance, thus reducing redundancy and alleviating the noise amplification inherent in CLIP's average pooling. Experimental results on three benchmark RS cross-modal retrieval datasets demonstrate that ReCoTR consistently outperforms existing methods on bidirectional image-text retrieval tasks, validating its effectiveness and robustness in remote sensing semantic alignment scenarios. Our source codes are available at: https://github.com/Jerry710/ReCoTR.git.
Huang et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: