Existing approaches to image-text retrieval often require large-scale models, extensive data, and substantial computational resources, limiting their accessibility for smaller research groups. We introduce LiteITR, an efficient self-supervised vision-language model that leverages pretrained unimodal encoders with contrastive learning and self-supervised knowledge distillation. While not reaching state-of-the-art performance, our approach demonstrates reasonable performance on retrieval tasks with dramatically reduced resources, requiring only 3M image-text pairs and costing approximately 20 to train. These findings underscore the potential for designing efficient multimodal retrieval systems that are trainable by researchers with limited resources.
Cares et al. (Thu,) studied this question.