Abstract Purpose To assess whether large language models (LLMs) with advanced reasoning and live web search (LWS) provide recommendations concordant with evidence‐based clinical practice guidelines (CPGs) developed by the American Academy of Orthopaedic Surgeons (AAOS) for anterior cruciate ligament (ACL) and rotator cuff (RC) injury management. Methods Recommendations from CPGs were extracted and developed into a total of 46 questions ( n = 15 for ACL, n = 31 for RC). Four configurations were evaluated: GPT‐5 Thinking, GPT‐5 Thinking Deep Research, Gemini 2.5 Pro, Gemini 2.5 Pro Deep Research. Concordance with CPGs, the primary endpoint, was independently evaluated by two orthopaedic surgeons. Citation integrity, the secondary endpoint, was evaluated against four criteria: 1—relevance, ensuring the citation was congruent with the response; 2—accuracy, confirming the citation metadata were correct; 3—existence, to rule out hallucinations; and 4—source quality, ensuring the cited source is from a peer‐reviewed journal. Blinding was performed by a third investigator, by anonymously randomising the order of LLM‐generated responses for each CPG recommendation. Results All LLMs answered ACL questions concordantly (100% 15/15; 95% confidence interval CI: 78.2%–100%). For RC questions, GPT‐5 Thinking and Gemini 2.5 Pro Deep Research each had one discordant answer (96.8% 30/31; 95% CI: 83.3%–99.9%), whereas the other two configurations were fully concordant (100% 31/31; 95% CI: 88.7%–100%). GPT‐5 Thinking achieved 96.8% (231/239; 95% CI: 93.6%–98.6%) citation integrity, improving to 100% (176/176; 95% CI: 97.9%–100%) with Deep Research. Gemini 2.5 Pro showed substantially lower baseline performance (64.6% 173/268; 95% CI: 58.5%–70.3%) but improved to 98.6% (274/278; 95% CI: 96.4%–99.6%) with Deep Research. Inter‐rater agreement was perfect ( κ = 1.0) across all domains, except for citation relevance, which maintained strong agreement ( κ = 0.88). Conclusions Contemporary LLMs with agentic capabilities can deliver clinically aligned answers concordant with CPGs on ACL and RC injuries, recovering from previous hallucinations. Built‐in LWS functions are particularly helpful in ensuring citation reliability. Although expert oversight remains imperative, Deep Research allows LLMs to be considered as a first‐pass clinical reasoning companion. Level of Evidence NA.
Building similarity graph...
Analyzing shared references across papers
Loading...
Hilmi Burak Sengul
Barış Akın
Mahmut Enes Kayaalp
Gazi University
Sağlık Bilimleri Üniversitesi
Fatih Sultan Mehmet Eğitim Ve Araştırma Hastanesi
Building similarity graph...
Analyzing shared references across papers
Loading...
Sengul et al. (Fri,) studied this question.
www.synapsesocial.com/papers/698827f00fc35cd7a8846fcc — DOI: https://doi.org/10.1002/ksa.70315