Introduction Large language models (LLMs) have demonstrated promising performance on standardized medical examinations, yet systematic comparisons of contemporary multimodal and text-only models on radiology-specific assessments remain limited. Updated and newly released LLMs, including Grok 4.1 (xAI, San Francisco, USA), Bing Copilot GPT-5 (Microsoft, Redmond, USA), DeepSeek V3.2 (DeepSeek AI, Beijing, China), and OpenEvidence (Chalmers University of Technology, Gothenburg, Sweden), have not been evaluated on the American College of Radiology Diagnostic Imaging In-Training (ACR DXIT) examination. This study aimed to compare the performance of seven contemporary LLMs on the 2022 DXIT examination, stratified by question format and radiology subject domain. Methods Seven LLMs were evaluated on all 106 multiple-choice questions from the 2022 DXIT examination, comprising 42 written-only and 64 image-based questions. Five multimodal models ChatGPT-5.1 (OpenAI, San Francisco, USA), Gemini 3 Pro (Google, Mountain View, USA), Claude Sonnet 4.5 (Anthropic, San Francisco, USA), Grok 4.1, and Bing Copilot GPT-5 were assessed on all questions. Two text-only models (DeepSeek V3.2 and OpenEvidence) were evaluated on written-only questions. A standardized orientation prompt was applied uniformly across all models. Statistical comparisons accounted for the paired nature of the data, as all models answered identical questions; Cochran's Q test was used for comparisons across three or more models, and McNemar's test for two-model comparisons. Ninety-five percent confidence intervals for accuracy proportions were calculated using the Wilson score method. For subgroups with fewer than 10 questions, p-values were not reported, and descriptive statistics only are presented. Results Overall accuracy among multimodal models ranged from 65.1% Claude Sonnet 4.5; 95% confidence interval (CI): 55.6%-73.5% to 76.4% (Gemini 3 Pro; 95% CI: 67.5%-83.5%), with no statistically significant differences among models (Cochran's Q=5.07, df=4, p=0.281). All multimodal models performed substantially better on written-only questions (88.1%-95.2%) than on image-based questions (46.9%-64.1%), representing an average gap of approximately 35 percentage points. Neither written-only nor image-based comparisons reached significance (p=0.813 and p=0.226, respectively). Domain-level analysis identified consistent strengths in ultrasound (80%-90%; p=0.948) and chest radiology (70%-90%; p=0.870), and persistent weakness in musculoskeletal imaging (40%-60%; p=0.898). Among text-only models, OpenEvidence and DeepSeek V3.2 achieved overall accuracies of 83.3% (95% CI: 69.4%-91.7%) and 88.1% (95% CI: 75.0%-94.8%), respectively, with no significant difference between them (McNemar's p=0.773). Conclusion Contemporary multimodal LLMs achieve moderately high accuracy on radiology in-training examination questions, exceeding earlier-generation model benchmarks and junior resident performance levels, yet no single model demonstrated statistically significant superiority. A consistent and substantial performance gap between written and image-based questions persists across all architectures, underscoring unresolved limitations in radiologic image interpretation. These findings suggest that current LLMs may support circumscribed roles in radiology education, particularly for conceptual and non-interpretive content, but remain unsuitable for tasks requiring visual diagnostic reasoning.
Huang et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: