What type of study is this?

September 10, 2025

Contrastive Learning‐Based Fine‐Tuning Method for Cross‐Modal Text‐Image Retrieval

Key Points

Achieves R@5 of 87.1% in text-to-image retrieval and 87.4% in image-to-text retrieval, showcasing strong performance.
Utilizes a dual-tower architecture with ResNet50 and RoBERTa, optimizing joint embedding space through InfoNCE loss.
Introduces a Locked-image Tuning strategy for effective model training and stability, enhancing retrieval accuracy.
Reduces GPU memory usage by 18%, balancing efficiency and scalability for applications like e-commerce searches.

Abstract

ABSTRACT With the rapid proliferation of social media and smart devices, multimodal data has grown explosively, making traditional unimodal retrieval methods insufficient for addressing cross‐modal semantic correlation tasks. To tackle the challenges caused by text redundancy and image noise in real‐world scenarios, this paper proposes a contrastive learning‐based, two‐stage progressive fine‐tuning approach for building a high‐precision text‐image cross‐modal retrieval system. We design an efficient data preprocessing pipeline: Text data undergoes tokenization, stop‐word filtering, and TF‐IDF‐based keyword extraction, while image data is enhanced using Cutout‐style random masking to improve robustness against occlusion and noise. The model employs a dual‐tower architecture composed of a ResNet50 visual encoder and a RoBERTa‐based text encoder, with joint embedding space optimized using InfoNCE loss. A Locked‐image Tuning (LiT) strategy is introduced, where the visual encoder is initially frozen and then both encoders are fine‐tuned jointly with mixed‐precision training and gradient clipping to ensure convergence stability. To improve data loading efficiency, we utilize LMDB to store 50,000 image‐text pairs, significantly reducing I/O overhead. Experiments on an industry‐scale dataset demonstrate that the fine‐tuned model achieves R@5 of 87.1% (text‐to‐image) and 87.4% (image‐to‐text), outperforming baselines by over 13% while reducing GPU memory usage by 18%. Our method achieves a balance between accuracy, efficiency, and scalability, making it suitable for applications such as social media content management and e‐commerce cross‐modal search.

اسأل الذكاء الاصطناعي

Bookmark