ABSTRACT With the rapid proliferation of social media and smart devices, multimodal data has grown explosively, making traditional unimodal retrieval methods insufficient for addressing cross‐modal semantic correlation tasks. To tackle the challenges caused by text redundancy and image noise in real‐world scenarios, this paper proposes a contrastive learning‐based, two‐stage progressive fine‐tuning approach for building a high‐precision text‐image cross‐modal retrieval system. We design an efficient data preprocessing pipeline: Text data undergoes tokenization, stop‐word filtering, and TF‐IDF‐based keyword extraction, while image data is enhanced using Cutout‐style random masking to improve robustness against occlusion and noise. The model employs a dual‐tower architecture composed of a ResNet50 visual encoder and a RoBERTa‐based text encoder, with joint embedding space optimized using InfoNCE loss. A Locked‐image Tuning (LiT) strategy is introduced, where the visual encoder is initially frozen and then both encoders are fine‐tuned jointly with mixed‐precision training and gradient clipping to ensure convergence stability. To improve data loading efficiency, we utilize LMDB to store 50,000 image‐text pairs, significantly reducing I/O overhead. Experiments on an industry‐scale dataset demonstrate that the fine‐tuned model achieves R@5 of 87.1% (text‐to‐image) and 87.4% (image‐to‐text), outperforming baselines by over 13% while reducing GPU memory usage by 18%. Our method achieves a balance between accuracy, efficiency, and scalability, making it suitable for applications such as social media content management and e‐commerce cross‐modal search.
Zhao et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: