Background: Multimodal large language models (MLLMs) show promise in medical imaging, but their performance is highly dependent on prompt engineering. This study systematically evaluates how different prompting strategies affect diagnostic accuracy in clinical laboratory image interpretation. Methods: We evaluated five MLLMs (ChatGPT-4o, Gemini 2.0 Flash, Claude 3.5 Sonnet, Grok-2, and Perplexity Pro (Claude 3.5 Sonnet)) using 177 proficiency testing images across three domains: blood smears (n = 78), urinalysis (n = 50), and parasitology (n = 49). Three prompting approaches were compared: (1) complex multi-choice prompts with 20 diagnostic options, (2) zero-shot open-ended prompts, and (3) two-step descriptive-reasoning prompts. Images were sourced from the Taiwan Society of Laboratory Medicine external quality assurance archives with expert consensus diagnoses. Results: Zero-shot prompting significantly outperformed complex multi-choice prompts across all models and domains (p 90% accuracy) demonstrates the considerable progress of multimodal AI. However, complex morphological tasks like blood smear interpretation require either specialized prompting techniques or domain-specific fine-tuning. These findings provide evidence-based guidance for optimizing AI integration in clinical laboratories.
Yang et al. (Wed,) studied this question.