What question did this study set out to answer?

This study aims to evaluate the impact of different prompting strategies on the diagnostic accuracy of multimodal large language models in medical laboratory image recognition.

April 25, 2026Open Access

Comparative Evaluation of Five Multimodal Large Language Models for Medical Laboratory Image Recognition: Impact of Prompting Strategies on Diagnostic Accuracy

Key Points

This study aims to evaluate the impact of different prompting strategies on the diagnostic accuracy of multimodal large language models in medical laboratory image recognition.
Evaluated five multimodal large language models using 177 proficiency testing images across blood smears, urinalysis, and parasitology domains.
Compared three prompting approaches: complex multi-choice prompts, zero-shot open-ended prompts, and two-step descriptive-reasoning prompts.
Data sourced from external quality assurance archives with expert consensus diagnoses.
Zero-shot prompting significantly outperformed complex multi-choice prompts across all models and domains (p < 0.001).
With zero-shot prompts, Gemini achieved 78.5% overall accuracy: urinalysis 92.0%, parasitology 75.5%, blood smears 64.1%.
Two-step descriptive-reasoning prompts improved blood smear accuracy by 8–12%, while the re-query mechanism improved urinalysis accuracy by 7.6%.

Abstract

Background: Multimodal large language models (MLLMs) show promise in medical imaging, but their performance is highly dependent on prompt engineering. This study systematically evaluates how different prompting strategies affect diagnostic accuracy in clinical laboratory image interpretation. Methods: We evaluated five MLLMs (ChatGPT-4o, Gemini 2.0 Flash, Claude 3.5 Sonnet, Grok-2, and Perplexity Pro (Claude 3.5 Sonnet)) using 177 proficiency testing images across three domains: blood smears (n = 78), urinalysis (n = 50), and parasitology (n = 49). Three prompting approaches were compared: (1) complex multi-choice prompts with 20 diagnostic options, (2) zero-shot open-ended prompts, and (3) two-step descriptive-reasoning prompts. Images were sourced from the Taiwan Society of Laboratory Medicine external quality assurance archives with expert consensus diagnoses. Results: Zero-shot prompting significantly outperformed complex multi-choice prompts across all models and domains (p 90% accuracy) demonstrates the considerable progress of multimodal AI. However, complex morphological tasks like blood smear interpretation require either specialized prompting techniques or domain-specific fine-tuning. These findings provide evidence-based guidance for optimizing AI integration in clinical laboratories.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper