What question did this study set out to answer?

This study aims to evaluate the reliability and diagnostic accuracy of multimodal large language models in interpreting radiological images.

February 26, 2026Open Access

Chatting Ain’t Diagnosing: Diagnostic Variability and Fundamental Errors in Multimodal LLM Interpretation in Radiology

Puntos clave

This study aims to evaluate the reliability and diagnostic accuracy of multimodal large language models in interpreting radiological images.
Evaluated five leading multimodal LLMs on a non-contrast head CT showing intracranial pathology.
Implemented a novel cross-evaluation protocol where each model graded all responses.
Focused on minimum competency thresholds by using straightforward pathology rather than complex cases.
Identified a 20% rate of fundamental diagnostic errors among the evaluated models.
One model incorrectly diagnosed ischemic stroke as intracerebral hemorrhage with wrong lateralization.
Significant variability in diagnosis acuity, anatomical localization, and differential diagnoses across models.

Resumen

Recent studies investigating the diagnostic capabilities of large language models (LLMs) have attracted significant media attention, often resulting in headlines claiming that AI systems can match or even outperform physicians. As LLMs have rapidly proliferated, this has fueled a widespread misconception that they represent the cutting edge of artificial intelligence in all contexts. This narrative tends to overshadow the continued importance of task-specific machine learning models, which were developed and validated for particular diagnostic applications well before the rise of LLMs. This single-case study evaluated the reliability of five leading multimodal LLMs (GPT-5, Gemini 3 Pro, Llama 4 Maverick, Grok 4, and Claude Opus 4.5 Extended) for radiological image interpretation by presenting each model with an identical non-contrast head CT demonstrating intracranial pathology, complemented by a novel cross-evaluation protocol wherein each model graded all responses. The deliberate use of a straightforward case (rather than diagnostically challenging pathology) aimed to establish minimum competency thresholds; if LLMs cannot reliably interpret obvious pathology, their deployment on ambiguous cases becomes indefensible. The study intentionally excluded human radiologist ground truth to avoid generating comparative accuracy metrics that could be selectively cited for commercial purposes, focusing instead on demonstrating class-wide limitations rather than ranking individual products. Results revealed a 20% rate of fundamental diagnostic error, with one model misidentifying ischemic stroke as intracerebral hemorrhage with incorrect lateralization. Even among concordant models, clinically meaningful variability persisted in acuity characterization, anatomical localization, and differential diagnoses. Cross-evaluation exposed ground truth disagreement between models, self-evaluation bias, inconsistent grading stringency, and divergent evaluation philosophies. Only one model included appropriate safety disclaimers. These findings demonstrate that current multimodal LLMs exhibit unacceptable diagnostic variability and evaluative inconsistency for autonomous clinical deployment. The appropriate clinical role for LLMs should be distinguished by deployment context: autonomous diagnosis requires validated task-specific models; decision support applications demand rigorous radiologist oversight protocols; and educational summarization represents the most appropriate current use case, with mandatory disclaimers. Healthcare applications requiring reliable image interpretation should prioritize validated, task-specific machine learning systems over general-purpose language models.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo