What does this research mean for the field?

ChatGPT 5 demonstrates the highest diagnostic accuracy for analyzing orthopedic trauma-related imaging among the evaluated AI platforms, but overall accuracy remains low across all models. Novelty: ClaimNovelty.CONFIRMATORY. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The research aims to evaluate the diagnostic capabilities of three AI models in identifying common orthopedic trauma fractures using imaging.

March 7, 2026

Comparing The Efficacy Between ChatGPT 5, Grok 3, and Claude 4.5 Sonnet in Analyzing Orthopedic Trauma-Related Imaging

Key Points

The research aims to evaluate the diagnostic capabilities of three AI models in identifying common orthopedic trauma fractures using imaging.
Conducted a retrospective comparison study
Utilized public online radiologic imaging databases
Assessed five common orthopedic trauma fractures
Evaluated ChatGPT 5, Grok 3, and Claude 4.5 Sonnet's diagnostic accuracy
Analyzed performance using radiographs and CT images
ChatGPT 5 diagnosed correctly in 26.8% of cases, Grok 3 in 18.8%, and Claude 4.5 Sonnet in 22.4%
Highest correct classification rates for fracture types were observed in ChatGPT 5
Sensitivity for ChatGPT 5 was 0.267, Grok 3 was 0.187, and Claude 4.5 Sonnet was 0.223
ChatGPT 5 and Grok 3 significantly outperformed Claude 4.5 Sonnet in diagnostic accuracy

Abstract

OBJECTIVES: To evaluate and compare the ability of three popular open-source artificial intelligence (AI) platforms to diagnose common trauma-related fractures using radiologic imaging. METHODS: Design: Retrospective diagnostic performance comparison study. Setting: Publicly accessible online radiologic imaging databases. Patient Selection Criteria: Five common orthopedic trauma fractures were assessed: ankle, tibial plateau, intertrochanteric, femoral neck, and humerus. Radiographs and computed tomography (CT) images were collected. Images were randomly selected from confirmed diagnoses on Radiopaedia.org. Outcome Measures and Comparisons: ChatGPT 5, Grok 3, and Claude 4.5 Sonnet were queried with each image. Diagnostic accuracy, sensitivity, specificity, positive and negative predictive values, and performance by modality (X-ray vs. CT) were assessed. The reference standard was the expert-verified diagnosis provided by Radiopaedia.org, limited to cases labeled with a “diagnosis certain” tag. RESULTS: Each model was provided with 30 radiographs and 20 CT images whenever possible. ChatGPT 5, Grok 3, and Claude 4.5 Sonnet accurately diagnosed diseased images in 26.8%, 18.8%, and 22.4% of cases, respectively. By fracture type, ChatGPT 5 demonstrated the highest correct classification rates for ankle (10%), femoral neck (38%), humerus (40%), and tibial plateau (44%) fractures. Grok 3 demonstrated the highest correct classification rate for intertrochanteric fractures (6%). Overall sensitivities were 0.267, 0.187, and 0.223 for ChatGPT 5, Grok 3, and Claude 4.5 Sonnet, respectively. ChatGPT 5 and Grok 3 outperformed Claude 4.5 Sonnet (both p<0.001). No modality-based performance differences were observed for any model. CONCLUSIONS: Among the publicly available large language models (LLMs) evaluated for radiologic interpretation of orthopedic trauma imaging, ChatGPT 5 demonstrated the highest overall diagnostic accuracy, followed by Claude 4.5 Sonnet and Grok 3. Despite relative variation between the models, overall diagnostic accuracy for fracture detection was low across all platforms (<27%). In their baseline forms, these publicly accessible LLMs are not recommended for radiologic imaging interpretation. LEVEL OF EVIDENCE: Level III

Mark Helpful

Bookmark

Relay