The cross-modal retrieval (CMR) refers to research for retrieve relevant information by semantical relationship between different modalities (text, image, or audio). The gap in modality is the biggest challenge in CMR, as each modality has different features and representations that do not allow direct comparison for retrieval. Therefore, we need to create a typical sub-space where we can directly compare the features of various modalities. CMR can have transformative applications in fields like multimedia analysis, enabling more accurate and efficient searching and comparison of images, videos, and text. In information retrieval, it enhances the ability to find relevant data across different formats, improving the user experience and search accuracy. For AI-driven content curation, CMR supports smarter, context-aware content recommendations by understanding and connecting information from various media types. Motivated by the potential impact of CMR, researchers have recently developed various deep learning-based methods to generate binary-valued or real-valued representations in the common sub-space. Current deep learning-based CMR methods generate a common sub-space using pairwise annotations during training. Deep learning methods learn an embedding that puts dissimilar instances farther apart and similar instances closer together in the common sub-space. The efficiency of the CMR system depends on how effectively embeddings are generated using different techniques. So, there is a need to generate effective embedding for efficient CMR. In this work, a shared sub-space is established using triplet labels rather than pairwise labels. Triplet label: A query, a similar, and a dissimilar instance compose the triplet label. Section 1 of the work solely focuses on how to generate triplet labels across modalities. The proposed method uses a Long Short Term memory (LSTM) network for text modality and convolutional network (CNN) for image modality. Given a query (text modality), the proposed method puts similar instances (image modality) nearer and dissimilar instances (image modality) far in the common sub-space. In addition to minimizing inter-modal triplet loss, the proposed system’s objective is to minimize intra-modal triplet loss for effective common sub-space generation. The experiments are performed to observe the impact of pairwise labels vs. triplet labels used to generate a common sub-space. Experiments are conducted on widely recognized datasets, including MSCOCO, Flickr8k, XMedia, and Wikipedia. This is the first attempt where both representations are generated utilizing triplet labels and a performance comparison is made. Experiments show (a) the real-valued representation achieves a relative increase of 2.90% mean average precision (mAP) over the existing deep learning-based methods. (b) For both image→text and text→image retrieval tasks, the proposed system illustrates the performance gap between the binary-valued representation and the real-valued representation, with a relative margin of 1.12% mAP. (c) Additionally, binary-valued representation requires less time for both training and testing in comparison to real-valued representation. Hence, by exploring triplet labels, our goal is to increase the effectiveness and precision of CMR systems while pushing the limits of existing techniques.
Bhatt et al. (Sun,) studied this question.