Knowledge-driven Visual Question Answering (VQA) necessitates combining external information apart from an image’s visual elements to produce accurate and contextually appropriate answers. Although Large Language Models (LLMs) show considerable promise in this area, their deficiency in structured reasoning and restricted access to specialized information limits their effectiveness, especially in specific domains such as medical diagnostics and patient care. In this study, we introduce a versatile, resilient, and domain-independent framework that improves LLM-powered Visual Question Answering (VQA) systems by incorporating structured reasoning and external knowledge. Our system utilizes ResNet50 for effective image feature extraction and FLAN-T5 for language-driven question answering, integrating them with a reasoning module to enhance accuracy. ResNet50 was chosen for its dependable efficiency and minimal computational demands, while FLAN-T5 offers robust reasoning skills with less complexity than larger models. In contrast to conventional end-to-end fine-tuning methods, our framework facilitates smooth incorporation with both open-source and commercial LLMs, lowering computational expenses while preserving high accuracy in zero-shot and few-shot learning contexts. ResNet50 and FLAN-T5 were chosen for their effective balance of performance and computational efficiency in comparison to more intricate models such as ViT or GPT-4. Utilizing multi-query ensemble techniques, context-sensitive feature selection, and the retrieval of external domain knowledge, our system greatly enhances explainability and reliability, making it especially appropriate for medical VQA applications. The integration of ResNet50 for advanced image comprehension, FLAN-T5 for intricate reasoning, and prompts guided by direction to integrate structured knowledge more efficiently guarantees a scalable and effective solution for real-time, knowledge-driven VQA systems. The suggested approach results in a 7% boost in accuracy and decreases response time by 30% in comparison to baseline techniques on the OKVQA dataset.
Building similarity graph...
Analyzing shared references across papers
Loading...
Noorbhasha Junnubabu
K. Geethanjali
B. Bhuvaneswari
Building similarity graph...
Analyzing shared references across papers
Loading...
Junnubabu et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69be38216e48c4981c678469 — DOI: https://doi.org/10.1051/matecconf/202641901019/pdf