Background/Objectives: Sacral neuromodulation (SNM) is an established therapy for refractory overactive bladder and non-obstructive urinary retention. With the rapid adoption of large language models (LLMs) such as ChatGPT, their accuracy in procedure-specific domains requires evaluation. The aim of this study was to compare the accuracy, completeness, and reproducibility of ChatGPT versions 3.5, 4.0, and 5.0 in answering patient- and guideline-based questions on SNM. Methods: Twenty questions were developed from international guidelines, device information, and common patient inquiries, covering five domains (mechanism, technique, outcomes, complications, postoperative management), two source types (frequently asked question FAQs vs. guideline), and three difficulty levels. These thematic domains were derived from core clinical counseling areas routinely addressed in SNM evaluation and follow-up. Each was submitted to ChatGPT versions 3.5, 4.0, and 5.0. Responses were rated independently by two urologists on a four-point accuracy scale. Combined success (Grades 1–2) and accuracy trends were compared across versions. Chi-square tests were used to assess differences across versions, Cramer’s V to measure effect size, and Cohen’s kappa to evaluate reproducibility. Results: Accuracy improved progressively across versions. Combined success rates rose from 70% in version 3.5 to 85% in 4.0 and 90% in 5.0 (p = 0.031, Cramer’s V = 0.29). Highest accuracy was observed in mechanism and procedural technique, while complication- and guideline-based questions showed lower performance. FAQ and straightforward questions were answered more reliably than guideline-based or complex ones. Reproducibility was excellent across all versions (κ = 0.81–0.91). Conclusions: ChatGPT 4.0 and 5.0 show strong potential as adjunctive tools for patient education in SNM, particularly for FAQs and procedural explanations. However, because persistent limitations were observed in guideline interpretation and complication management, clinician oversight remains essential, and these models should not be regarded as substitutes for professional clinical judgment.
K. Eskandar (Thu,) studied this question.