Key points are not available for this paper at this time.
In vision-language retrieval systems, users provide natural language feedback to find target images. Vision-language explanations in the systems can better guide users to provide feedback and thus improve the retrieval. However, developing explainable vision-language retrieval systems can be challenging, due to limited labeled multimodal data. In the retrieval of complex scenes, the issue of limited labeled data can be more severe. With multiple objects in the complex scenes, each user query may not exhaustively describe all objects in the desired image and thus more labeled queries are needed. The issue of limited labeled data can cause data selection biases, and result in spurious correlations learned by the models. When learning spurious correlations, existing explainable models may not be able to accurately extract regions from images and keywords from user queries.
Wu et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: