Medical Visual Question Answering (MedVQA) aims to generate accurate answers to clinical questions based on medical images. Although recent generative vision-language models have improved open-ended answer generation, two key challenges remain insufficiently addressed. First, many medical questions depend on small lesions, organ-level structures, or subtle imaging findings, which may be weakened when the model relies only on whole-image representations. Second, question semantics are not always effectively used to determine which local regions are relevant and how much they should contribute to answer generation. To address these issues, we propose MVE-QGCAF, a generative MedVQA framework based on Multi-grained Visual Evidence Construction (MVE) and Question-guided Gated Cross-Attention Fusion (QGCAF). MVE constructs a global-local visual representation by selecting question-conditioned regions of interest from automatically generated candidate medical regions while retaining the whole-image context. QGCAF further decomposes multimodal fusion into two sequential operations: cross-attention-based retrieval of multi-grained visual evidence and question-guided gating for local evidence contribution calibration. The calibrated visual representation is then projected into the input space of a generative language model for concise medical answer generation. Experiments on VQA-RAD, SLAKE, and VQA-Med-2019 show that MVE-QGCAF achieves overall accuracies of 80.50±0.21%, 89.22±0.20%, and 82.17±0.11%, respectively. Further ablation studies, local-region quality analysis, gating-weight visualization, computational cost analysis, and failure case analysis demonstrate that the performance gains mainly stem from question-conditioned evidence selection and adaptive local evidence calibration, rather than from the simple addition of local image crops. These results indicate that organizing and weighting local visual evidence according to question semantics is an effective strategy for generative MedVQA.
Yu et al. (Tue,) studied this question.