October 15, 2019

Learning Fragment Self-Attention Embeddings for Image-Text Matching

YWYiling WuNational Chung Cheng University SWShuhui WangYunnan Agricultural University GSGuoli SongPeng Cheng Laboratory

Key Points

Key points are not available for this paper at this time.

Abstract

In image-text matching task, the key to good matching quality is to capture the rich contextual dependencies between fragments of image and text. However, previous works either simply aggregate the similarity of all possible pairs of image regions and words, or take multi-step cross attention to attend to image regions and words with each other as context, which requires exhaustive similarity computation between all image region and word pairs. In this paper, we propose Self-Attention Embeddings (SAEM) to exploit fragment relations in images or texts by self-attention mechanism, and aggregate fragment information into visual and textual embeddings. Specifically, SAEM extracts salient image regions based on bottom-up attention, and takes WordPiece tokens as sentence fragments. The self-attention layers are built to model subtle and fine-grained fragment relation in image and text respectively, which consists of multi-head self-attention sub-layer and position-wise feed-forward network sub-layer. Consequently, the fragment self-attention mechanism can discover the fragment relations and identify the semantically salient regions in images or words in sentences, and capture their interaction more accurately. By simultaneously exploiting the fine-grained fragment relation in both visual and textual modalities, our method produces more semantically consistent embeddings for representing images and texts, and demonstrates promising image-text matching accuracy and high efficiency on Flickr30K and MSCOCO datasets.

KI fragen

Bookmark

View Full Paper