What question did this study set out to answer?

This study aims to explore language-based evaluation metrics for Medical Visual Question Answering systems to better capture model performance beyond exact-match accuracy.

May 8, 2026Open Access

A linguistic lens into vision-language models for open-ended question-answers in medical visual question answering

Key Points

This study aims to explore language-based evaluation metrics for Medical Visual Question Answering systems to better capture model performance beyond exact-match accuracy.
Systematic benchmarking of multimodal vision-language models using three transformer-based visual encoders and three text encoders.
Models fine-tuned and evaluated on VQA-Med 2020 and VQA-Med 2021 datasets.
Performance assessed using exact-match accuracy alongside BLEU-1 to BLEU-4 scores.
DeiT-based visual backbones consistently outperform ViT and BEiT models on both datasets.
DeiT + BERT achieved 52.27% accuracy and a cumulative BLEU score of 26.87 on VQA-Med 2020.
On VQA-Med 2021, DeiT + RoBERTa reached 54.87% accuracy, while DeiT + ALBERT had the highest BLEU score of 25.50.

Abstract

Objectives Medical Visual Question Answering (MedVQA) systems are predominantly evaluated using exact-match accuracy, which fails to account for partially correct or clinically insightful answers, particularly in open-ended question settings. This study aims to investigate the effectiveness of language-based evaluation metrics, specifically BLEU scores at multiple n-gram levels, in providing a more nuanced perspective of model performance beyond strict accuracy. Methods We conducted a systematic benchmarking of multimodal vision–language models by combining three transformer-based visual encoders (ViT, BEiT, and DeiT) with three transformer-based text encoders (BERT, RoBERTa, and ALBERT). All model variants were fine-tuned and evaluated on the VQA-Med 2020 and VQA-Med 2021 datasets. Performance was assessed using exact-match accuracy alongside BLEU-1 to BLEU-4 and cumulative BLEU scores to capture varying degrees of lexical and semantic overlap between predicted and reference answers. Results Experimental results demonstrate that DeiT-based visual backbones consistently outperform ViT and BEiT across both datasets. On VQA-Med 2020, the DeiT + BERT model achieved the highest accuracy of 52.27% with a cumulative BLEU score of 26.87. On VQA-Med 2021, DeiT + RoBERTa reached the highest accuracy of 54.87%, while DeiT + ALBERT achieved the highest cumulative BLEU score of 25.50. Qualitative analyses further show that BLEU scores effectively capture partial correctness, synonymous terminology, and clinically relevant overlaps that are ignored by accuracy-based evaluation. Conclusion This study demonstrates that language-based evaluation metrics, particularly BLEU scores, provide important complementary insights beyond accuracy in open-ended MedVQA tasks. By capturing semantic similarity and partial correctness, such language-based metrics support a more clinically meaningful assessment of model outputs. These findings emphasize the need for multi-metric evaluation frameworks, as exact-match accuracy alone is insufficient for open-ended responses.

A linguistic lens into vision-language models for open-ended question-answers in medical visual question answering

Key Points

Abstract

Cite This Study