Purpose: To evaluate the performance of large language models (LLMs) in infectious disease–specific clinical reasoning and decision-making. Patients and Methods: A comprehensive evaluation of four widely used LLMs—DeepSeek-R1, ChatGPT-5, Grok 3, and Gemini 2.5 Flash—was conducted using a dual assessment framework that combined examination-based questions with real-world clinical cases. LLMs performance was compared with that of infectious disease residents. Examination outcomes were assessed using accuracy and score rates, while clinical case responses were evaluated by expert reviewers using predefined Likert-scale criteria. In addition, doctors’ independent clinical decisions were compared with those supported by LLMs to assess the potential value of human–AI collaboration. Results: Across examination-based assessments, LLMs performed comparably to infectious disease residents, with no significant differences observed in accuracy or score rates ( p = 0.54). LLMs showed a trend toward better performance on low-order, knowledge-based questions ( p = 0.34), whereas doctors tended to perform better on simple case-based questions ( p = 0.74), particularly those requiring higher-order clinical reasoning ( p = 0.10). In real-world clinical case evaluations (n=10), LLM-generated responses achieved high ratings for accuracy, completeness, individualization, safety, and readability, with comparable performance across models ( p > 0.05). Importantly, doctors’ decision-making supported by LLMs showed a trend toward improved accuracy compared with independent decisions ( p = 0.19). Notably, for completeness, doctors supported by LLMs achieved significantly higher scores compared with both doctors alone ( p = 0.0015) and the LLMs alone ( p = 0.0010). Nevertheless, clinically meaningful errors occurred in certain high-risk scenarios, underscoring the limitations of standalone LLM decision-making. Conclusion: LLMs show substantial potential in infectious disease education and clinical decision support, particularly for knowledge-based tasks. However, their limitations in complex clinical reasoning underscore the necessity of clinician oversight. A human–AI collaborative approach appears to offer the greatest benefit, enhancing decision quality while maintaining clinical safety. Continued refinement of regulatory and medico-legal frameworks is critical to support the safe, ethical, and responsible deployment of LLMs in clinical practice. Keywords: large language models, infectious diseases, clinical decision-making, human–AI collaboration
Wu et al. (Mon,) studied this question.