What does this research mean for the field?

AI-generated responses regarding orthobiologic injections are generally accurate but not written at a patient-appropriate readability level, with Gemini outperforming ChatGPT in accuracy. Novelty: ClaimNovelty.CONFIRMATORY. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This research aims to evaluate the accuracy and readability of AI-generated responses to questions about orthobiologic injections.

February 19, 2026Open Access

Evaluating Artificial Intelligence-Generated Responses to Patient Questions Regarding Orthobiologic Injections

Key Points

This research aims to evaluate the accuracy and readability of AI-generated responses to questions about orthobiologic injections.
Collected responses to 20 common questions from ChatGPT, Gemini, and Grok.
Used the ChatGPT Response Rating System (CRRS) and AI Response Metric (AIRM) for accuracy assessment.
Evaluated readability using the Flesch-Kincaid Grade Level (FKGL) scale.
Engaged four reviewers, including specialized orthopedic surgeons and nonoperative sports medicine physicians.
Interrater reliability was strong for accuracy ratings (ICCs > 0.70; P < .05).
Response accuracy varied: 50% (ChatGPT), 25% (Gemini), and 30% (Grok) needed clarification.
ANOVA showed significant differences in accuracy scores depending on the AI model (P = .02), with Gemini performing best.
All responses exceeded the recommended 6th-grade reading level, with mean FKGL indicating collegiate-level complexity.

Abstract

Background: Patient interest in orthobiologic injections continues to grow. While modern patients are increasingly reliant on artificial intelligence (AI) large language models (LLMs) for health information, it remains unclear whether AI-generated responses regarding orthobiologics are both accurate and written at a reading level suitable for patient education. Purpose: To assess the accuracy and readability of responses to common patient questions regarding orthobiologic injections from 3 popular AI LLMs (ChatGPT, Gemini, and Grok). Study Design: Cross-sectional Study. Methods: Responses to 20 common patient questions regarding orthobiologic injections were recorded from ChatGPT 4o, Gemini 2.5 Flash, and Grok 3 in July 2025. Four independent reviewers (2 fellowship-trained sports medicine orthopaedic surgeons and 2 fellowship-trained nonoperative sports medicine physicians) assessed AI responses for accuracy using the ChatGPT Response Rating System (CRRS) and the AI Response Metric (AIRM). Readability of responses was assessed using the Flesch-Kincaid Grade Level (FKGL). Results: Interrater reliability was strong for all accuracy ratings (ICCs >0.70; P 2). One-way matched analysis of variance (ANOVA) revealed a significant effect of AI model on both CRRS ( P = .02) and AIRM scores ( P = .02), with Gemini displaying improved accuracy compared with ChatGPT (CRRS, P = .04; AIRM, P = .03). Regarding readability, the mean FKGL of all 3 models was at a collegiate level or higher, and all responses exceeded the American Medical Association and National Institutes of Health-recommended 6th-grade reading level for patient education. One-way matched ANOVA revealed a significant effect of AI model on FKGL ( P = .02), with Gemini displaying reduced readability compared with ChatGPT ( P = .03). Conclusion: In this study, ChatGPT, Gemini, and Grok provided generally accurate information on orthobiologics but failed to produce responses at a patient-appropriate readability level. Gemini outperformed ChatGPT in accuracy, although all 3 models demonstrated significant limitations in clarity. Until these issues are resolved, AI-generated responses should serve only as supplemental resources, with final patient education directed by physicians.

Evaluating Artificial Intelligence-Generated Responses to Patient Questions Regarding Orthobiologic Injections

Key Points

Abstract

Cite This Study