What question did this study set out to answer?

April 18, 2026Open Access

Performance of GPT-5 and Gemini 2.5 Pro on the Orthopaedic In-Training Examination

Key Points

Assess the performance of GPT-5 and Gemini 2.5 Pro on the Orthopaedic In-Training Examination regarding accuracy and reasoning quality.
Conducted a controlled evaluation of GPT-5 and Gemini 2.5 Pro against 412 OITE-style questions.
Evaluated overall and subspecialty-specific accuracy as primary outcomes.
Analyzed explanatory quality and error patterns for both models.
Used McNemar’s exact test for paired accuracy comparison.
Assessed performance on questions with and without imaging.
Gemini 2.5 Pro outperformed GPT-5 on overall accuracy (81.1% vs 76.0%).
Accuracy decreased with image-containing questions (74.2% vs 71.6%).
Subspecialty accuracy varied, with poor performance in Hand and Wrist questions for both models.
Faulty reasoning was the main error for GPT-5 (52.5%), while Gemini had stem misinterpretation (43.6%).
High response consistency rates (88% for GPT-5, 84% for Gemini) noted, especially in non-image questions.

Abstract

Background Previous studies evaluating large language models (LLMs) on the Orthopaedic In-Training Examination (OITE) have primarily focused on earlier-generation models and single-pass accuracy. These investigations did not assess newer multimodal systems such as GPT-5 and Gemini 2.5 Pro, nor did they examine the reasoning quality underlying model responses or the consistency of outputs across repeated trials. As LLMs are increasingly used as educational tools, a more comprehensive evaluation framework is needed to assess not only correctness but also reliability and explanatory validity on specialty-specific, image-rich examinations. Methods We conducted a controlled, parallel evaluation of GPT-5 and Gemini 2.5 Pro using 412 OITE-style questions from the 2023–2024 examination cycle obtained via an institutional AAOS ResStudy subscription. Primary outcomes included overall and subspecialty-specific accuracy. Secondary analyses evaluated explanatory quality, error-pattern classification, response consistency across repeated trials, and performance stratified by imaging burden. Paired accuracy was compared using McNemar’s exact test. Results Gemini 2.5 Pro demonstrated higher overall accuracy than GPT-5 on the 2023–2024 OITE question set (81.1% vs 76.0), with both models exceeding published PGY-5 resident benchmarks. Accuracy declined significantly with questions containing images (74.2% vs 71.6%). Subspecialty performance varied widely, with accuracy ranging from 42.9% to 94.1% for GPT-5 and from 57.1% to 95.8% for Gemini, and both models performing poorest in Hand and Wrist questions. Among incorrect responses, faulty reasoning accounted for 52.5% of GPT-5 errors, whereas stem misinterpretation was the predominant error for Gemini (43.6%). Incorrect or partially correct explanations accompanied 45.4% of GPT-5 and 41.7% of Gemini responses. Consistency testing showed high reproducibility (fully consistent responses: 88% for GPT-5 and 84% for Gemini), with all inconsistent outputs occurring in image-containing questions. Conclusions GPT-5 and Gemini 2.5 Pro demonstrate strong performance on recent OITE content, exceeding prior LLM benchmarks; however, persistent limitations in multimodal reasoning, explanatory reliability, and response consistency indicate that high accuracy alone does not ensure dependable clinical reasoning, underscoring the need for cautious educational use.

Bookmark

View Full Paper

Bookmark

View Full Paper

Performance of GPT-5 and Gemini 2.5 Pro on the Orthopaedic In-Training Examination

Key Points

Abstract

Cite This Study