October 17, 2021

Deconfounded and Explainable Interactive Vision-Language Retrieval of Complex Scenes

Key Points

Key points are not available for this paper at this time.

Abstract

In vision-language retrieval systems, users provide natural language feedback to find target images. Vision-language explanations in the systems can better guide users to provide feedback and thus improve the retrieval. However, developing explainable vision-language retrieval systems can be challenging, due to limited labeled multimodal data. In the retrieval of complex scenes, the issue of limited labeled data can be more severe. With multiple objects in the complex scenes, each user query may not exhaustively describe all objects in the desired image and thus more labeled queries are needed. The issue of limited labeled data can cause data selection biases, and result in spurious correlations learned by the models. When learning spurious correlations, existing explainable models may not be able to accurately extract regions from images and keywords from user queries.

KI fragen

Bookmark

Cite This Study

Wu et al. (Sun,) studied this question.

synapsesocial.com/papers/6a092973266340834eb62b0b https://doi.org/https://doi.org/10.1145/3474085.3475366

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

KI fragen

Bookmark