What question did this study set out to answer?

This research aims to improve image retrieval efficiency using a combination of text and image embeddings.

April 4, 2026Open Access

A Cross-Modal Approach to Enhancing Image Retrieval With Contrastive Language-Image Pretraining (CLIP)-Based Embeddings and Facebook AI Similarity Search (FAISS) Indexing

Key Points

This research aims to improve image retrieval efficiency using a combination of text and image embeddings.
Developed a framework using CLIP for generating shared embeddings of images and text.
Indexed embeddings with FAISS for efficient vector similarity search.
Evaluated retrieval performance using mean average precision and Recall@k metrics.
Analyzed similarity distributions and used t-distributed stochastic neighbor embeddings for visualization.
Achieved strong retrieval performance across text-to-image and image-to-image tasks.
Demonstrated effective organization of images based on conceptual similarity in the embedding space.
Results indicate improved search capabilities for large multimedia datasets.

Abstract

Cross-modal image retrieval plays an important role in managing large multimedia collections and supporting efficient search across visual and textual data. This study introduces an image retrieval framework based on Contrastive Language-Image Pretraining (CLIP) and Facebook AI Similarity Search (FAISS). The system combines multimodal embedding generation with high-performance vector similarity indexing to support efficient cross-modal search. The framework uses the Contrastive Language-Image Pretraining model to generate shared embeddings for both images and text queries. These embeddings place visual and linguistic information within the same semantic space, which allows the system to connect text descriptions with related images. To support fast and scalable search, the generated embeddings are indexed using the FAISS library. FAISS performs efficient k-nearest neighbor retrieval in high-dimensional vector spaces, which enables rapid similarity search across large datasets. The system supports both text-to-image and image-to-image retrieval tasks. Users search an image database either with descriptive text queries or with reference images. Experimental evaluation shows strong retrieval performance, with effective results based on mean average precision and Recall@k metrics. Additional analyses strengthen these findings. Similarity score distributions and t-distributed stochastic neighbor embeddings show clear grouping of images by conceptual similarity within the embedding space. These results demonstrate how the CLIP representation organizes images based on meaning rather than simple visual patterns. Overall, the results show the value of combining multimodal representation learning with scalable vector indexing. The proposed CLIP-FAISS framework offers a practical solution for image retrieval and supports applications such as visual search engines, digital libraries, and multimedia content management systems.

Bookmark

View Full Paper

Bookmark

View Full Paper

A Cross-Modal Approach to Enhancing Image Retrieval With Contrastive Language-Image Pretraining (CLIP)-Based Embeddings and Facebook AI Similarity Search (FAISS) Indexing

Key Points

Abstract

Cite This Study