Artificial intelligence (AI) continues to evolve as a tool in clinical decision support. Large language models (LLMs), such as ChatGPT and DeepSeek, are increasingly used in medicine to provide fast, accessible information. This study aimed to compare the performance of ChatGPT and DeepSeek in generating recommendations for the management of postprostatectomy urinary incontinence (PPUI), based on the AUA/SUFU guideline. A total of 20 questions (10 conceptual and 10 case-based) were developed by three urologists with expertise in PPUI, following the AUA/SUFU guideline. Each question was submitted in English using zero-shot prompting to ChatGPT-4o and DeepSeek R1. Responses were limited to 200 words and graded independently as correct (1 point), partially correct (0.5), or incorrect (0). Total and domain-specific scores were compared. ChatGPT achieved 19 out of 20 points (95.0%), while DeepSeek scored 14.5 (72.5%; p = 0.031). In conceptual questions, scores were 9.0 (ChatGPT) and 8.0 (DeepSeek; p = 0.50). In case-based scenarios, ChatGPT scored 10.0 versus 6.5 for DeepSeek (p = 0.08). ChatGPT outperformed DeepSeek across all guideline domains. DeepSeek made critical errors in the treatment domain, such as recommending a male sling for radiated patients. ChatGPT demonstrated superior performance in providing guideline-based recommendations for PPUI. However, both models should be used under expert supervision, and future research is needed to optimize their safe integration into clinical workflows.
Pinto et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: