What question did this study set out to answer?

The research aims to evaluate how AI models enhance citation integrity in orthopaedic guidelines.

February 8, 2026

Deep research capabilities in GPT-5 thinking and Gemini 2.5 Pro improve citation integrity and concordance with American Academy of Orthopaedic Surgeons anterior cruciate ligament and rotator cuff guidelines.

Key Points

The research aims to evaluate how AI models enhance citation integrity in orthopaedic guidelines.
Analysis of citation patterns using GPT-5 and Gemini 2.5 Pro
Comparison with American Academy of Orthopaedic Surgeons guidelines
Assessment of concordance rates in citations
Significant improvement in citation integrity observed
Higher concordance rates with guidelines noted
Enhanced recommendations for anterior cruciate ligament and rotator cuff treatments

Abstract

Abstract Purpose To assess whether large language models (LLMs) with advanced reasoning and live web search (LWS) provide recommendations concordant with evidence‐based clinical practice guidelines (CPGs) developed by the American Academy of Orthopaedic Surgeons (AAOS) for anterior cruciate ligament (ACL) and rotator cuff (RC) injury management. Methods Recommendations from CPGs were extracted and developed into a total of 46 questions ( n = 15 for ACL, n = 31 for RC). Four configurations were evaluated: GPT‐5 Thinking, GPT‐5 Thinking Deep Research, Gemini 2.5 Pro, Gemini 2.5 Pro Deep Research. Concordance with CPGs, the primary endpoint, was independently evaluated by two orthopaedic surgeons. Citation integrity, the secondary endpoint, was evaluated against four criteria: 1—relevance, ensuring the citation was congruent with the response; 2—accuracy, confirming the citation metadata were correct; 3—existence, to rule out hallucinations; and 4—source quality, ensuring the cited source is from a peer‐reviewed journal. Blinding was performed by a third investigator, by anonymously randomising the order of LLM‐generated responses for each CPG recommendation. Results All LLMs answered ACL questions concordantly (100% 15/15; 95% confidence interval CI: 78.2%–100%). For RC questions, GPT‐5 Thinking and Gemini 2.5 Pro Deep Research each had one discordant answer (96.8% 30/31; 95% CI: 83.3%–99.9%), whereas the other two configurations were fully concordant (100% 31/31; 95% CI: 88.7%–100%). GPT‐5 Thinking achieved 96.8% (231/239; 95% CI: 93.6%–98.6%) citation integrity, improving to 100% (176/176; 95% CI: 97.9%–100%) with Deep Research. Gemini 2.5 Pro showed substantially lower baseline performance (64.6% 173/268; 95% CI: 58.5%–70.3%) but improved to 98.6% (274/278; 95% CI: 96.4%–99.6%) with Deep Research. Inter‐rater agreement was perfect ( κ = 1.0) across all domains, except for citation relevance, which maintained strong agreement ( κ = 0.88). Conclusions Contemporary LLMs with agentic capabilities can deliver clinically aligned answers concordant with CPGs on ACL and RC injuries, recovering from previous hallucinations. Built‐in LWS functions are particularly helpful in ensuring citation reliability. Although expert oversight remains imperative, Deep Research allows LLMs to be considered as a first‐pass clinical reasoning companion. Level of Evidence NA.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Hilmi Burak Sengul

Barış Akın

Mahmut Enes Kayaalp

Actions

Institutions

Gazi University

Sağlık Bilimleri Üniversitesi

Fatih Sultan Mehmet Eğitim Ve Araştırma Hastanesi

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Deep research capabilities in GPT-5 thinking and Gemini 2.5 Pro improve citation integrity and concordance with American Academy of Orthopaedic Surgeons anterior cruciate ligament and rotator cuff guidelines.

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study