What type of study is this?

September 10, 2025Open Access

A global–local dual-stream collaborative enhancement model for cross-modal image and text retrieval in remote sensing scenarios

Key Points

The proposed model enhances retrieval performance by utilizing a global-local collaborative enhancement method, addressing multi-modal feature limitations.
Key evidence shows improved semantic correlation between visual and text features using cosine similarity and contrastive loss for optimization.
The cross-modal retrieval framework incorporates graph convolutional networks for local feature enhancement alongside global features, improving data representation.
This method indicates potential advancements in smart city development and ecological protection through effective remote sensing data utilization.

Abstract

Remote sensing image-text cross-modal retrieval technology serves as a key supporting technology for transforming remote sensing big data into knowledge services, and it holds significant application value in national strategic areas such as smart city development and ecological environment protection. To address the limitations of single feature representation caused by the multimodal object types and complex texture spatial structures of remote sensing images, this paper proposes a cross-modal retrieval framework based on global–local feature collaborative enhancement. In terms of visual representation, a local feature branch using graph convolutional networks is introduced based on contrastive language-image pre-training global features, along with a local–global feature synergistic enhancement module designed to achieve self-attention enhancement of multi-scale features, complementary calibration, and hierarchical fusion through an adaptive weight allocation strategy. For text representation, an attention-convolution fusion module is designed to enhance the semantics of the text and align it with visual information. Finally, cosine similarity is employed to compute the semantic correlation between visual and text features, and contrastive loss is used for model optimization. Experiments on the RSITMD and RSICD datasets verify the effectiveness of the proposed method.

Read Full Paperexternally

AI से पूछें

Bookmark

View Full Paper