Key points are not available for this paper at this time.
In the fields of computer vision and natural language processing, cross-modal retrieval is of great importance that cannot be ignored. In existing multi-granularity alignment methods, significant progress has been made by globally aligning images and sentences or locally aligning regions and words. However, most of these methods rely on attention mechanisms, which may be affected by noise and cannot balance the differences between modalities, resulting in suboptimal image-text alignment. In addition, training with attention mechanisms consumes a lot of computational resources, and the retrieval process is time-consuming. To address these challenges, we develop a novel multigranularity image-text alignment model, which we call DFVLM. First, we independently train the intra-modal encoder and cross-modal encoder to reduce interference between different modal encoders and better learn intra-modal and intermodal information. Then, we propose a joint strategy to combine the intra-modal and inter-modal information to better capture the information between image and text. In addition, we introduce a hard negative pair-based method to train the performance of the intra-modal multi-granularity encoder without using attention mechanisms. Unlike existing methods, We treat image and text retrieval as a two-way process and delve into the intrinsic connections between the two matrices. We propose a cross-validation method to optimize the retrieval result based on common image-to-text alignment scenarios in daily life, i.e., during the image-to-text retrieval process, a text-to-image reverse verification is performed on the retrieved corresponding text. Through extensive qualitative experiments and analysis, our approach performs well on both Flickr30K and MSCOCO datasets and also significantly reduces the time required for testing.
Han et al. (Thu,) studied this question.