What question did this study set out to answer?

April 8, 2026Open Access

Comparative Performance of Seven Mainstream Large Language Models on the 2022 American College of Radiology Diagnostic Imaging In-Training Examination

Puntos clave

To evaluate the performance of seven contemporary large language models on the 2022 American College of Radiology Diagnostic Imaging In-Training Examination.
Evaluated seven LLMs on 106 multiple-choice questions from the 2022 DXIT examination.
Included five multimodal models and two text-only models, assessed on written and image-based questions.
Applied a standardized prompt across models and used Cochran's Q test and McNemar's test for statistical comparisons.
Multimodal models achieved accuracy ranging from 65.1% to 76.4%, with no significant differences among them.
Performance gap observed between written-only questions (88.1%-95.2%) and image-based questions (46.9%-64.1%).
Text-only models OpenEvidence and DeepSeek V3.2 had accuracies of 83.3% and 88.1%, with no significant difference.

Resumen

Introduction Large language models (LLMs) have demonstrated promising performance on standardized medical examinations, yet systematic comparisons of contemporary multimodal and text-only models on radiology-specific assessments remain limited. Updated and newly released LLMs, including Grok 4.1 (xAI, San Francisco, USA), Bing Copilot GPT-5 (Microsoft, Redmond, USA), DeepSeek V3.2 (DeepSeek AI, Beijing, China), and OpenEvidence (Chalmers University of Technology, Gothenburg, Sweden), have not been evaluated on the American College of Radiology Diagnostic Imaging In-Training (ACR DXIT) examination. This study aimed to compare the performance of seven contemporary LLMs on the 2022 DXIT examination, stratified by question format and radiology subject domain. Methods Seven LLMs were evaluated on all 106 multiple-choice questions from the 2022 DXIT examination, comprising 42 written-only and 64 image-based questions. Five multimodal models ChatGPT-5.1 (OpenAI, San Francisco, USA), Gemini 3 Pro (Google, Mountain View, USA), Claude Sonnet 4.5 (Anthropic, San Francisco, USA), Grok 4.1, and Bing Copilot GPT-5 were assessed on all questions. Two text-only models (DeepSeek V3.2 and OpenEvidence) were evaluated on written-only questions. A standardized orientation prompt was applied uniformly across all models. Statistical comparisons accounted for the paired nature of the data, as all models answered identical questions; Cochran's Q test was used for comparisons across three or more models, and McNemar's test for two-model comparisons. Ninety-five percent confidence intervals for accuracy proportions were calculated using the Wilson score method. For subgroups with fewer than 10 questions, p-values were not reported, and descriptive statistics only are presented. Results Overall accuracy among multimodal models ranged from 65.1% Claude Sonnet 4.5; 95% confidence interval (CI): 55.6%-73.5% to 76.4% (Gemini 3 Pro; 95% CI: 67.5%-83.5%), with no statistically significant differences among models (Cochran's Q=5.07, df=4, p=0.281). All multimodal models performed substantially better on written-only questions (88.1%-95.2%) than on image-based questions (46.9%-64.1%), representing an average gap of approximately 35 percentage points. Neither written-only nor image-based comparisons reached significance (p=0.813 and p=0.226, respectively). Domain-level analysis identified consistent strengths in ultrasound (80%-90%; p=0.948) and chest radiology (70%-90%; p=0.870), and persistent weakness in musculoskeletal imaging (40%-60%; p=0.898). Among text-only models, OpenEvidence and DeepSeek V3.2 achieved overall accuracies of 83.3% (95% CI: 69.4%-91.7%) and 88.1% (95% CI: 75.0%-94.8%), respectively, with no significant difference between them (McNemar's p=0.773). Conclusion Contemporary multimodal LLMs achieve moderately high accuracy on radiology in-training examination questions, exceeding earlier-generation model benchmarks and junior resident performance levels, yet no single model demonstrated statistically significant superiority. A consistent and substantial performance gap between written and image-based questions persists across all architectures, underscoring unresolved limitations in radiologic image interpretation. These findings suggest that current LLMs may support circumscribed roles in radiology education, particularly for conceptual and non-interpretive content, but remain unsuitable for tasks requiring visual diagnostic reasoning.

Me gusta

Guardar

Ver artículo completo

Cite This Study

Huang et al. (Sun,) studied this question.

synapsesocial.com/papers/69d5f0d774eaea4b11a7a4b3 https://doi.org/https://doi.org/10.7759/cureus.106486

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Me gusta

Guardar

Ver artículo completo