Abstract Background The accuracy of large language models (LLMs) in specialized fields like sexual medicine and infertility remains unclear. Given the complexity of these topics, evaluating AI models is essential before considering their use in medical education or clinical support. We set out to compare the accuracy of ChatGPT and DeepSeek in answering andrology questions. Methods This study was conducted between February and March 2025. ChatGPT and DeepSeek were evaluated based on their ability to answer urology questions in the domains of sexual dysfunction and infertility. Both models operated on knowledge bases with information available up to August 2023, without real-time internet access. Questions were sourced from a multiple-choice question (MCQ) database developed by urology residency program directors. A medical student submitted each question three times to both chatbots using a standardized prompt. The chatbot-generated answers were compared to the correct answers. Descriptive statistics were used to analyze accuracy, and McNemar’s test was applied to compare model performance. Results A total of 381 questions were evaluated, distributed across male sexual dysfunction (MSD) (n = 183), female sexual dysfunction (FSD) (n = 33), male factor infertility (MFI) (n = 150), and sexually transmitted infections (STI) (n = 15). ChatGPT correctly answered 221 questions (58.0%), and DeepSeek correctly answered 228 questions (59.8%), with no statistically significant difference between the two models (McNemar’s test, χ2 = 0.52, p = 0.47). When stratified by question category, ChatGPT outperformed DeepSeek in female sexual dysfunction (78.8% vs. 69.7%) and marginally in male sexual dysfunction (58.5% vs. 56.3%). DeepSeek performed better in male factor infertility (64.0% vs. 55.3%) and sexually transmitted infections (40.0% vs. 33.3%). Conclusions ChatGPT and DeepSeek demonstrated comparable overall performance in answering urology questions on sexual dysfunction and infertility. However, performance varied by question type, highlighting the need for domain-specific validation before clinical or educational deployment of AI models.
Seyam et al. (Sun,) studied this question.