What question did this study set out to answer?

To assess the adequacy of ChatGPT responses to frequently asked questions regarding adolescent idiopathic scoliosis (AIS).

May 10, 2026

Adequacy of ChatGPT’s Responses to Frequently Asked Questions for Patients With Adolescent Idiopathic Scoliosis

Key Points

To assess the adequacy of ChatGPT responses to frequently asked questions regarding adolescent idiopathic scoliosis (AIS).
Cross-sectional study design was used to compare responses from three versions of ChatGPT.
Thirty frequently asked questions were rated by three orthopedic spine surgeons using a Likert scale.
Responses evaluated against expert websites for accuracy, with statistical comparisons performed.
Median Likert scores were 4 for ChatGPT-3.5, 4 for ChatGPT-4, and 4 for ChatGPT-4o, with significant differences in overall scores (P=0.004).
ChatGPT-4o achieved higher accuracy compared to ChatGPT-3.5 (P=0.005).
86% of ChatGPT-3.5 responses were acceptable for patient use, while ChatGPT-4 and ChatGPT-4o had 96% appropriateness.

Abstract

STUDY DESIGN: Cross-sectional study. OBJECTIVE: To evaluate whether the answers of different versions of ChatGPT to frequently asked questions about AIS compiled from patient education websites the American Academy of Orthopaedic Surgeons (AAOS) and the Scoliosis Research Society (SRS) provide appropriate and sufficient information to patients. SUMMARY OF BACKGROUND DATA: Artificial intelligence chatbots have gained popularity due to their ability to analyze substantial scientific data using machine learning techniques and generate human-like responses in medicine. These responses can guide patients and families who are seeking information online after a diagnosis of AIS. METHODS: Thirty frequently asked questions, selected by expert spine surgeons, were posed to 3 versions of ChatGPT using a new internet browser window for each question, and the responses were recorded. Three orthopedic spine surgeons graded the accuracy of the responses against 2 selected expert websites using a Likert scale. Finally, the response accuracy was evaluated for patient use. RESULTS: Median Likert scores for ChatGPT-3.5, ChatGPT-4, and ChatGPT-4o were 4 (1-5), 4 (2-5), and 4 (2-5), respectively. No significant differences were observed among versions within individual categories (all P>0.05). However, a significant difference was found in the overall response scores (P=0.004). Post hoc analysis revealed that ChatGPT-4o achieved significantly higher accuracy than ChatGPT-3.5 (P=0.005, Bonferroni-adjusted), whereas other pairwise comparisons were not significant. When the adequacy of the responses was evaluated, 26/30 (86%) of ChatGPT-3.5 responses were acceptable for patient use, whereas ChatGPT-4 and ChatGPT-4o provided appropriate responses in 29/30 (96%) of the questions. CONCLUSIONS: Successive ChatGPT versions demonstrated improved response reliability, with ChatGPT-4o showing a statistically significant advantage over ChatGPT-3.5. Given that ChatGPT-4 and ChatGPT-4o provided accurate and patient-appropriate answers in 96% of cases, these tools may assist in online patient education under clinician supervision. LEVEL OF EVIDENCE: Level III.

Mark Helpful

Bookmark

Relay