What question did this study set out to answer?

The aim is to develop an efficient model for image-text retrieval using pretrained unimodal models.

January 18, 2026Open Access

Leveraging Pretrained Unimodal Models for Efficient Image-Text Retrieval

Key Points

The aim is to develop an efficient model for image-text retrieval using pretrained unimodal models.
Introduced the LiteITR model combining self-supervised learning and contrastive techniques.
Utilized pretrained unimodal encoders for effective image-text alignment.
Limited training to 3 million image-text pairs to enhance accessibility for smaller teams.
Achieved reasonable performance on retrieval tasks without state-of-the-art benchmarks.
Reduced training cost to approximately $20, making it feasible for smaller research groups.

Abstract

Existing approaches to image-text retrieval often require large-scale models, extensive data, and substantial computational resources, limiting their accessibility for smaller research groups. We introduce LiteITR, an efficient self-supervised vision-language model that leverages pretrained unimodal encoders with contrastive learning and self-supervised knowledge distillation. While not reaching state-of-the-art performance, our approach demonstrates reasonable performance on retrieval tasks with dramatically reduced resources, requiring only 3M image-text pairs and costing approximately 20 to train. These findings underscore the potential for designing efficient multimodal retrieval systems that are trainable by researchers with limited resources.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Cares et al. (Thu,) studied this question.

synapsesocial.com/papers/696c79cde45ebfc9113cd411 https://doi.org/https://doi.org/10.25968/opus-3818

Bookmark

View Full Paper