Key points are not available for this paper at this time.
Visual Question Answering (VQA) is a task that requires models to comprehend both questions and images. An increasing number of works are leveraging the strong reasoning capabilities of Large Language Models (LLMs) to address VQA. These methods typically utilize image captions as visual text description to aid LLMs in comprehending images. However, these captions often overlooking the relations of fine-grained objects, which will limit the reasoning capability of LLMs. In this paper, we present PFVR, a modular framework that Prompts LLMs with Fine-grained Visual Relationships for VQA. PFVR primarily consists of an answer-guided generation module (AGG) and a question-guided filtering module (QGF). The two modules can combine to extract the fine-grained visual relations from scene graph, which will finally serve as crucial context for LLMs to comprehend the image. Extensive experiments conducted on the popular VQA dataset, GQA, confirm PFVR achieves state-of-the-art results compared to other strong VQA competitors, demonstrating its exceptional effectiveness.
Building similarity graph...
Analyzing shared references across papers
Loading...
Jiapeng Liu
Chengyang Fang
Liang Li
Chinese Academy of Sciences
University of Chinese Academy of Sciences
Institute of Information Engineering
Building similarity graph...
Analyzing shared references across papers
Loading...
Liu et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68e7397eb6db6435876b2a28 — DOI: https://doi.org/10.1109/icassp48485.2024.10448321