Key points are not available for this paper at this time.
In vision-and-language grounding problems, fine-grained representations of image are considered to be of paramount importance. Most of the current incorporate visual features and textual concepts as a sketch of an. However, plainly inferred representations are usually undesirable in they are composed of separate components, the relations of which are. In this work, we aim at representing an image with a set of integrated regions and corresponding textual concepts, reflecting certain. To this end, we build the Mutual Iterative Attention (MIA) module, integrates correlated visual features and textual concepts, respectively, aligning the two modalities. We evaluate the proposed approach on two vision-and-language grounding tasks, i. e. , image captioning and question answering. In both tasks, the semantic-grounded image consistently boost the performance of the baseline models under metrics across the board. The results demonstrate that our approach is and generalizes well to a wide range of models for image-related. (The code is available at https: //github. com/fenglinliu98/MIA)
Liu et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: