Remote sensing image-text cross-modal retrieval technology serves as a key supporting technology for transforming remote sensing big data into knowledge services, and it holds significant application value in national strategic areas such as smart city development and ecological environment protection. To address the limitations of single feature representation caused by the multimodal object types and complex texture spatial structures of remote sensing images, this paper proposes a cross-modal retrieval framework based on global–local feature collaborative enhancement. In terms of visual representation, a local feature branch using graph convolutional networks is introduced based on contrastive language-image pre-training global features, along with a local–global feature synergistic enhancement module designed to achieve self-attention enhancement of multi-scale features, complementary calibration, and hierarchical fusion through an adaptive weight allocation strategy. For text representation, an attention-convolution fusion module is designed to enhance the semantics of the text and align it with visual information. Finally, cosine similarity is employed to compute the semantic correlation between visual and text features, and contrastive loss is used for model optimization. Experiments on the RSITMD and RSICD datasets verify the effectiveness of the proposed method.
Zhang et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: