What question did this study set out to answer?

March 15, 2026

(O-20) Performance comparison of chatgpt and deepseek in answering urology questions on sexual dysfunction and infertility

Key Points

This research aims to evaluate the accuracy of ChatGPT and DeepSeek in answering urology questions related to sexual dysfunction and infertility.
Evaluation of ChatGPT and DeepSeek based on predefined urology questions.
Questions sourced from an MCQ database created by urology residency program directors.
Medical student submitted each question thrice to both chatbots with standardized prompts.
Answers compared to correct responses using descriptive statistics and McNemar’s test.
ChatGPT answered 221 questions correctly (58.0%), DeepSeek answered 228 (59.8%).
No significant performance difference (p = 0.47).
ChatGPT excelled in female sexual dysfunction (78.8% vs. 69.7%).
DeepSeek surpassed ChatGPT in male factor infertility (64.0% vs. 55.3%).

Abstract

Abstract Background The accuracy of large language models (LLMs) in specialized fields like sexual medicine and infertility remains unclear. Given the complexity of these topics, evaluating AI models is essential before considering their use in medical education or clinical support. We set out to compare the accuracy of ChatGPT and DeepSeek in answering andrology questions. Methods This study was conducted between February and March 2025. ChatGPT and DeepSeek were evaluated based on their ability to answer urology questions in the domains of sexual dysfunction and infertility. Both models operated on knowledge bases with information available up to August 2023, without real-time internet access. Questions were sourced from a multiple-choice question (MCQ) database developed by urology residency program directors. A medical student submitted each question three times to both chatbots using a standardized prompt. The chatbot-generated answers were compared to the correct answers. Descriptive statistics were used to analyze accuracy, and McNemar’s test was applied to compare model performance. Results A total of 381 questions were evaluated, distributed across male sexual dysfunction (MSD) (n = 183), female sexual dysfunction (FSD) (n = 33), male factor infertility (MFI) (n = 150), and sexually transmitted infections (STI) (n = 15). ChatGPT correctly answered 221 questions (58.0%), and DeepSeek correctly answered 228 questions (59.8%), with no statistically significant difference between the two models (McNemar’s test, χ2 = 0.52, p = 0.47). When stratified by question category, ChatGPT outperformed DeepSeek in female sexual dysfunction (78.8% vs. 69.7%) and marginally in male sexual dysfunction (58.5% vs. 56.3%). DeepSeek performed better in male factor infertility (64.0% vs. 55.3%) and sexually transmitted infections (40.0% vs. 33.3%). Conclusions ChatGPT and DeepSeek demonstrated comparable overall performance in answering urology questions on sexual dysfunction and infertility. However, performance varied by question type, highlighting the need for domain-specific validation before clinical or educational deployment of AI models.

Ask AI

Helpful

Bookmark