What question did this study set out to answer?

The aim is to enhance visual question answering systems by integrating structured reasoning and external knowledge.

March 21, 2026

Knowledge-Based Visual Question Answering System Using Multimodal Deep Learning

Puntos clave

The aim is to enhance visual question answering systems by integrating structured reasoning and external knowledge.
Introduced a framework combining ResNet50 for image feature extraction and FLAN-T5 for language processing.
Utilized multi-query ensemble techniques and context-sensitive feature selection.
Enabled integration with open-source and commercial large language models to lower computational costs.
Achieved a 7% increase in accuracy compared to baseline methods.
Reduced response time by 30% on the OKVQA dataset.
Improved explainability and reliability for medical applications.

Resumen

Knowledge-driven Visual Question Answering (VQA) necessitates combining external information apart from an image’s visual elements to produce accurate and contextually appropriate answers. Although Large Language Models (LLMs) show considerable promise in this area, their deficiency in structured reasoning and restricted access to specialized information limits their effectiveness, especially in specific domains such as medical diagnostics and patient care. In this study, we introduce a versatile, resilient, and domain-independent framework that improves LLM-powered Visual Question Answering (VQA) systems by incorporating structured reasoning and external knowledge. Our system utilizes ResNet50 for effective image feature extraction and FLAN-T5 for language-driven question answering, integrating them with a reasoning module to enhance accuracy. ResNet50 was chosen for its dependable efficiency and minimal computational demands, while FLAN-T5 offers robust reasoning skills with less complexity than larger models. In contrast to conventional end-to-end fine-tuning methods, our framework facilitates smooth incorporation with both open-source and commercial LLMs, lowering computational expenses while preserving high accuracy in zero-shot and few-shot learning contexts. ResNet50 and FLAN-T5 were chosen for their effective balance of performance and computational efficiency in comparison to more intricate models such as ViT or GPT-4. Utilizing multi-query ensemble techniques, context-sensitive feature selection, and the retrieval of external domain knowledge, our system greatly enhances explainability and reliability, making it especially appropriate for medical VQA applications. The integration of ResNet50 for advanced image comprehension, FLAN-T5 for intricate reasoning, and prompts guided by direction to integrate structured knowledge more efficiently guarantees a scalable and effective solution for real-time, knowledge-driven VQA systems. The suggested approach results in a 7% boost in accuracy and decreases response time by 30% in comparison to baseline techniques on the OKVQA dataset.

Preguntar a la IA

Me gusta

Guardar