What question did this study set out to answer?

This research aims to evaluate the effectiveness of vision-language models in medical visual question answering tasks, focusing on their performance on radiology images.

February 5, 2026Open Access

Systematic Analysis of Vision–Language Models for Medical Visual Question Answering

Key Points

This research aims to evaluate the effectiveness of vision-language models in medical visual question answering tasks, focusing on their performance on radiology images.
Compared three vision-language models (ViLT, BLIP, MiniCPM-V-2) on medical visual question answering tasks.
Used standard datasets (SLAKE, OmniMedVQA-Mini) focusing on CT, MRI, and X-ray images.
Implemented zero-shot evaluation followed by supervised fine-tuning on modality-specific data.
Added a post-hoc option-selection layer to enhance prediction accuracy.
Initial zero-shot performance was low, especially for MiniCPM-V-2, confirming off-the-shelf models are insufficient.
Fine-tuning improved model performance significantly, with ViLT achieving about 80% exact match.
With option selection, ViLT and BLIP reached approximately 90–93% exact match and F1 scores.

Abstract

General-purpose vision–language models (VLMs) are increasingly applied to imaging tasks, yet their reliability on medical visual question answering (Med-VQA) remains unclear. We investigate how three state-of-the-art VLMs—ViLT, BLIP, and MiniCPM-V-2—perform on radiology-focused Med-VQA when evaluated in a modality-aware manner. Using SLAKE and OmniMedVQA-Mini, we construct harmonised subsets for computed tomography (CT), magnetic resonance imaging (MRI), and X-ray, standardising schema and answer processing. We first benchmark all models in a strict zero-shot setting, then perform supervised fine-tuning on modality-specific data splits, and finally add a post-hoc semantic option-selection layer that maps free-text predictions to multiple-choice answers. Zero-shot performance is modest (exact match ≈20% for ViLT/BLIP and 0% for MiniCPM-V-2), confirming that off-the-shelf deployment is inadequate. Fine-tuning substantially improves all models, with ViLT reaching ≈80% exact match and BLIP ≈50%, while MiniCPM-V-2 lags behind. When coupled with option selection, ViLT and BLIP achieve 90–93% exact match and F1 across all modalities, corresponding to 95–97% BERTScore-F1. Our novel results show that (i) modality-specific supervision is essential for Med-VQA, and (ii) post-hoc option selection can transform strong but imperfect generative predictions into highly reliable discrete decisions on harmonised radiology benchmarks. The latter is useful for medical VLMs that combine generative responses with option or sentence selection.

Systematic Analysis of Vision–Language Models for Medical Visual Question Answering

Key Points

Abstract

Cite This Study

Also Consider

Also Consider