Key points are not available for this paper at this time.
In image-text matching task, the key to good matching quality is to capture the rich contextual dependencies between fragments of image and text. However, previous works either simply aggregate the similarity of all possible pairs of image regions and words, or take multi-step cross attention to attend to image regions and words with each other as context, which requires exhaustive similarity computation between all image region and word pairs. In this paper, we propose Self-Attention Embeddings (SAEM) to exploit fragment relations in images or texts by self-attention mechanism, and aggregate fragment information into visual and textual embeddings. Specifically, SAEM extracts salient image regions based on bottom-up attention, and takes WordPiece tokens as sentence fragments. The self-attention layers are built to model subtle and fine-grained fragment relation in image and text respectively, which consists of multi-head self-attention sub-layer and position-wise feed-forward network sub-layer. Consequently, the fragment self-attention mechanism can discover the fragment relations and identify the semantically salient regions in images or words in sentences, and capture their interaction more accurately. By simultaneously exploiting the fine-grained fragment relation in both visual and textual modalities, our method produces more semantically consistent embeddings for representing images and texts, and demonstrates promising image-text matching accuracy and high efficiency on Flickr30K and MSCOCO datasets.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yiling Wu
National Chung Cheng University
Shuhui Wang
Yunnan Agricultural University
Guoli Song
Peng Cheng Laboratory
Chinese Academy of Sciences
University of Chinese Academy of Sciences
Building similarity graph...
Analyzing shared references across papers
Loading...
Wu et al. (Tue,) studied this question.
synapsesocial.com/papers/6a15895ea2f71238514e847c — DOI: https://doi.org/10.1145/3343031.3350940