What type of study is this?

This is a Quantitative Study study.

October 11, 2025Open Access

A CLIP-Based Cross-Modal Matching Model for Image-Text Retrieval

Key Points

The proposed model shows strong performance in image-text retrieval tasks, achieving retrieval rates from 80% to 90%.
Feature Entropy improvements indicate enhanced semantic expressiveness, with notable metrics for text and image features.
The methodology utilizes pre-trained VIT and BERT models, which significantly boosts training efficiency and model convergence.
Compared to the DeViSE model, our approach demonstrates superior accuracy, improving retrieval metrics by up to 12.9%.

Abstract

In recent years, the demand for multimodal data retrieval has been growing rapidly. As two major modalities for information transmission, images and texts exhibit significant differences in feature distribution. To address challenges in image-text retrieval—such as balancing efficiency with performance and enhancing semantic modelling—this paper proposes an efficient cross-modal feature matching model based on the CLIP framework, including two parts: feature extraction and contrastive learning. During feature extraction, pre-trained VIT and BERT models are used to capture deep semantic features of images and texts, which achieve significant improvements in Feature Entropy (text: 4.27 vs. 3.62; image: 4.13 vs. 3.47) and Mutual Information (28.3% for text, 31.5% for image) compared with the baseline, indicating stronger semantic expressiveness and alignment. Through contrastive learning with the cosine-based loss function and Adam optimization, the model ensures stable convergence. Furthermore, preprocessing innovations such as removing redundant text tokens and Base64 image encoding boost training efficiency. Experiments on a dataset of 50,000 image-text pairs demonstrate that our model achieves high and stable retrieval performance with R@1, R@5, and R@10 scores ranging from 80% to 90%. Compared to the classic DeViSE model, our approach yields improvements of 12.9%, 10.0%, and 9.0% across the three metrics, confirming the model’s superior accuracy and generalization in large-scale retrieval scenarios. Finally, the model is evaluated on image-text retrieval tasks, where it consistently demonstrates strong cross-modal matching capabilities and accurately captures the semantic associations between images and texts.

Read Full Paperexternally

Demander à l'IA

Bookmark

View Full Paper