What question did this study set out to answer?

This research aims to enhance accuracy in medical visual question answering by effectively utilizing local imaging details and question semantics.

June 26, 2026Open Access

MVE-QGCAF: Multi-Grained visual evidence construction and question-guided gated cross-attention fusion for medical visual question answering

Key Points

This research aims to enhance accuracy in medical visual question answering by effectively utilizing local imaging details and question semantics.
Proposed a framework (MVE-QGCAF) combining Multi-grained Visual Evidence Construction and Question-guided Gated Cross-Attention Fusion.
Constructed global-local visual representation from regions of interest determined by clinical questions.
Evaluated on datasets VQA-RAD, SLAKE, and VQA-Med-2019 with ablation studies and visual analytics.
Achieved accuracies of 80.50±0.21%, 89.22±0.20%, and 82.17±0.11% on VQA-RAD, SLAKE, and VQA-Med-2019, respectively.
Demonstrated significant performance improvements through question-conditioned evidence selection and adaptive local calibration.

Abstract

Medical Visual Question Answering (MedVQA) aims to generate accurate answers to clinical questions based on medical images. Although recent generative vision-language models have improved open-ended answer generation, two key challenges remain insufficiently addressed. First, many medical questions depend on small lesions, organ-level structures, or subtle imaging findings, which may be weakened when the model relies only on whole-image representations. Second, question semantics are not always effectively used to determine which local regions are relevant and how much they should contribute to answer generation. To address these issues, we propose MVE-QGCAF, a generative MedVQA framework based on Multi-grained Visual Evidence Construction (MVE) and Question-guided Gated Cross-Attention Fusion (QGCAF). MVE constructs a global-local visual representation by selecting question-conditioned regions of interest from automatically generated candidate medical regions while retaining the whole-image context. QGCAF further decomposes multimodal fusion into two sequential operations: cross-attention-based retrieval of multi-grained visual evidence and question-guided gating for local evidence contribution calibration. The calibrated visual representation is then projected into the input space of a generative language model for concise medical answer generation. Experiments on VQA-RAD, SLAKE, and VQA-Med-2019 show that MVE-QGCAF achieves overall accuracies of 80.50±0.21%, 89.22±0.20%, and 82.17±0.11%, respectively. Further ablation studies, local-region quality analysis, gating-weight visualization, computational cost analysis, and failure case analysis demonstrate that the performance gains mainly stem from question-conditioned evidence selection and adaptive local evidence calibration, rather than from the simple addition of local image crops. These results indicate that organizing and weighting local visual evidence according to question semantics is an effective strategy for generative MedVQA.

Mark Helpful

Bookmark

Relay

View Full Paper