Objectives Medical Visual Question Answering (MedVQA) systems are predominantly evaluated using exact-match accuracy, which fails to account for partially correct or clinically insightful answers, particularly in open-ended question settings. This study aims to investigate the effectiveness of language-based evaluation metrics, specifically BLEU scores at multiple n-gram levels, in providing a more nuanced perspective of model performance beyond strict accuracy. Methods We conducted a systematic benchmarking of multimodal vision–language models by combining three transformer-based visual encoders (ViT, BEiT, and DeiT) with three transformer-based text encoders (BERT, RoBERTa, and ALBERT). All model variants were fine-tuned and evaluated on the VQA-Med 2020 and VQA-Med 2021 datasets. Performance was assessed using exact-match accuracy alongside BLEU-1 to BLEU-4 and cumulative BLEU scores to capture varying degrees of lexical and semantic overlap between predicted and reference answers. Results Experimental results demonstrate that DeiT-based visual backbones consistently outperform ViT and BEiT across both datasets. On VQA-Med 2020, the DeiT + BERT model achieved the highest accuracy of 52.27% with a cumulative BLEU score of 26.87. On VQA-Med 2021, DeiT + RoBERTa reached the highest accuracy of 54.87%, while DeiT + ALBERT achieved the highest cumulative BLEU score of 25.50. Qualitative analyses further show that BLEU scores effectively capture partial correctness, synonymous terminology, and clinically relevant overlaps that are ignored by accuracy-based evaluation. Conclusion This study demonstrates that language-based evaluation metrics, particularly BLEU scores, provide important complementary insights beyond accuracy in open-ended MedVQA tasks. By capturing semantic similarity and partial correctness, such language-based metrics support a more clinically meaningful assessment of model outputs. These findings emphasize the need for multi-metric evaluation frameworks, as exact-match accuracy alone is insufficient for open-ended responses.
Lameesa et al. (Sun,) studied this question.