What question did this study set out to answer?

The aim is to enhance medical visual question answering by overcoming existing model limitations.

March 5, 2026Open Access

KGLMQA: enhancing medical visual question answering with knowledge graphs and LLMs

Key Points

The aim is to enhance medical visual question answering by overcoming existing model limitations.
Developed KGLMQA framework combining knowledge graphs with large language models.
Implemented a classification model with gating mechanisms and feature fusion.
Included a Knowledge Graph Retrieval Augmented Generation module for structured medical knowledge retrieval.
Achieved state-of-the-art performance in accuracy and precision metrics on multiple datasets.
Outperformed LLaVA-Med model in handling open-ended questions on VQA-RAD dataset.
Demonstrated strong robustness against visual noise and surpassed GPT-4o in diagnostic logicality.

Abstract

Medical Visual Question Answering (MedVQA) leverages computer vision and natural language processing techniques to assist in clinical decision-making. However, existing models frequently encounter challenges such as restricted multimodal interaction, insufficient guidance from external medical knowledge, and a lack of rigorous diagnostic logic in their responses. To address these issues, we propose KGLMQA, a novel framework that integrates knowledge graphs with Large Language Models (LLMs). The framework comprises three core components: a high-precision MedVQA classification model utilizing gating mechanisms and multi-stage feature fusion; a Knowledge Graph Retrieval Augmented Generation (KGRAG) module for dynamically retrieving and refining structured medical knowledge; and an LLM that generates professional responses based on structured prompts. Experimental results on the Patient-oriented Visual Question Answering (P-VQA), Visual Question Answering in Radiology (VQA-RAD), and Semantically-Labeled Knowledge-Enhanced (SLAKE) datasets demonstrate that KGLMQA achieves state-of-the-art performance in metrics such as Accuracy and Precision. Notably, it outperforms the fully fine-tuned Large Language-and-Vision Assistant for Biomedicine (LLaVA-Med) model in handling open-ended questions on the VQA-RAD dataset. Further error propagation analysis and comparative evaluation with Generative Pre-trained Transformer 4o (GPT-4o) reveal that KGLMQA not only exhibits strong robustness against upstream visual noise but also surpasses GPT-4o in terms of diagnostic logicality. These findings indicate that integrating visual diagnostic cues and explicit structured knowledge with LLMs significantly enhances the interpretability and clinical application potential of MedVQA systems.

AIに質問

Bookmark

View Full Paper