What question did this study set out to answer?

This research aims to evaluate the accuracy of ChatGPT versions in responding to queries about sacral neuromodulation.

February 14, 2026Open Access

Assessing ChatGPT Accuracy Across Versions for Patient and Guideline Queries in Sacral Neuromodulation

Key Points

This research aims to evaluate the accuracy of ChatGPT versions in responding to queries about sacral neuromodulation.
Development of twenty questions based on guidelines and patient inquiries.
Assessment of responses by two urologists using a four-point accuracy scale.
Comparison of combined success rates and accuracy trends across ChatGPT versions using statistical tests.
Accuracy improved from 70% in version 3.5 to 90% in version 5.0.
Reproducibility was high across versions, indicated by Cohen’s kappa values between 0.81 and 0.91.
FAQ and procedural questions yielded higher accuracy compared to guideline or complex questions.

Abstract

Background/Objectives: Sacral neuromodulation (SNM) is an established therapy for refractory overactive bladder and non-obstructive urinary retention. With the rapid adoption of large language models (LLMs) such as ChatGPT, their accuracy in procedure-specific domains requires evaluation. The aim of this study was to compare the accuracy, completeness, and reproducibility of ChatGPT versions 3.5, 4.0, and 5.0 in answering patient- and guideline-based questions on SNM. Methods: Twenty questions were developed from international guidelines, device information, and common patient inquiries, covering five domains (mechanism, technique, outcomes, complications, postoperative management), two source types (frequently asked question FAQs vs. guideline), and three difficulty levels. These thematic domains were derived from core clinical counseling areas routinely addressed in SNM evaluation and follow-up. Each was submitted to ChatGPT versions 3.5, 4.0, and 5.0. Responses were rated independently by two urologists on a four-point accuracy scale. Combined success (Grades 1–2) and accuracy trends were compared across versions. Chi-square tests were used to assess differences across versions, Cramer’s V to measure effect size, and Cohen’s kappa to evaluate reproducibility. Results: Accuracy improved progressively across versions. Combined success rates rose from 70% in version 3.5 to 85% in 4.0 and 90% in 5.0 (p = 0.031, Cramer’s V = 0.29). Highest accuracy was observed in mechanism and procedural technique, while complication- and guideline-based questions showed lower performance. FAQ and straightforward questions were answered more reliably than guideline-based or complex ones. Reproducibility was excellent across all versions (κ = 0.81–0.91). Conclusions: ChatGPT 4.0 and 5.0 show strong potential as adjunctive tools for patient education in SNM, particularly for FAQs and procedural explanations. However, because persistent limitations were observed in guideline interpretation and complication management, clinician oversight remains essential, and these models should not be regarded as substitutes for professional clinical judgment.

Bookmark

View Full Paper

Bookmark

View Full Paper

Assessing ChatGPT Accuracy Across Versions for Patient and Guideline Queries in Sacral Neuromodulation

Key Points

Abstract

Cite This Study