March 18, 2024Open Access

Prompting Large Language Models with Fine-Grained Visual Relations from Scene Graph for Visual Question Answering

Key Points

Key points are not available for this paper at this time.

Abstract

Visual Question Answering (VQA) is a task that requires models to comprehend both questions and images. An increasing number of works are leveraging the strong reasoning capabilities of Large Language Models (LLMs) to address VQA. These methods typically utilize image captions as visual text description to aid LLMs in comprehending images. However, these captions often overlooking the relations of fine-grained objects, which will limit the reasoning capability of LLMs. In this paper, we present PFVR, a modular framework that Prompts LLMs with Fine-grained Visual Relationships for VQA. PFVR primarily consists of an answer-guided generation module (AGG) and a question-guided filtering module (QGF). The two modules can combine to extract the fine-grained visual relations from scene graph, which will finally serve as crucial context for LLMs to comprehend the image. Extensive experiments conducted on the popular VQA dataset, GQA, confirm PFVR achieves state-of-the-art results compared to other strong VQA competitors, demonstrating its exceptional effectiveness.

Read Full Paperexternally

AIに質問

Bookmark

View Full Paper