May 15, 2019Open Access

Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

Key Points

Key points are not available for this paper at this time.

Abstract

In vision-and-language grounding problems, fine-grained representations of image are considered to be of paramount importance. Most of the current incorporate visual features and textual concepts as a sketch of an. However, plainly inferred representations are usually undesirable in they are composed of separate components, the relations of which are. In this work, we aim at representing an image with a set of integrated regions and corresponding textual concepts, reflecting certain. To this end, we build the Mutual Iterative Attention (MIA) module, integrates correlated visual features and textual concepts, respectively, aligning the two modalities. We evaluate the proposed approach on two vision-and-language grounding tasks, i. e. , image captioning and question answering. In both tasks, the semantic-grounded image consistently boost the performance of the baseline models under metrics across the board. The results demonstrate that our approach is and generalizes well to a wide range of models for image-related. (The code is available at https: //github. com/fenglinliu98/MIA)

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper