Cross-modal image retrieval plays an important role in managing large multimedia collections and supporting efficient search across visual and textual data. This study introduces an image retrieval framework based on Contrastive Language-Image Pretraining (CLIP) and Facebook AI Similarity Search (FAISS). The system combines multimodal embedding generation with high-performance vector similarity indexing to support efficient cross-modal search. The framework uses the Contrastive Language-Image Pretraining model to generate shared embeddings for both images and text queries. These embeddings place visual and linguistic information within the same semantic space, which allows the system to connect text descriptions with related images. To support fast and scalable search, the generated embeddings are indexed using the FAISS library. FAISS performs efficient k-nearest neighbor retrieval in high-dimensional vector spaces, which enables rapid similarity search across large datasets. The system supports both text-to-image and image-to-image retrieval tasks. Users search an image database either with descriptive text queries or with reference images. Experimental evaluation shows strong retrieval performance, with effective results based on mean average precision and Recall@k metrics. Additional analyses strengthen these findings. Similarity score distributions and t-distributed stochastic neighbor embeddings show clear grouping of images by conceptual similarity within the embedding space. These results demonstrate how the CLIP representation organizes images based on meaning rather than simple visual patterns. Overall, the results show the value of combining multimodal representation learning with scalable vector indexing. The proposed CLIP-FAISS framework offers a practical solution for image retrieval and supports applications such as visual search engines, digital libraries, and multimedia content management systems.
Fakoya et al. (Thu,) studied this question.