What question did this study set out to answer?

The study aims to assess the performance of large language models in infectious disease-specific clinical reasoning and decision-making.

June 13, 2026Open Access

Evaluation of Large Language Models in Infectious Disease Decision-Making: From Examination to Clinical Practice

Key Points

The study aims to assess the performance of large language models in infectious disease-specific clinical reasoning and decision-making.
Evaluated four large language models: DeepSeek-R1, ChatGPT-5, Grok 3, and Gemini 2.5 Flash.
Used examination-based questions and real-world clinical cases evaluated by expert reviewers.
Compared doctors' independent clinical decisions with those supported by LLMs to measure collaboration value.
LLMs performed comparably to infectious disease residents in examination-based assessments (p=0.54).
Doctors supported by LLMs had significantly higher completeness scores than both doctors alone (p=0.0015) and LLMs alone (p=0.0010).
Clinically meaningful errors occurred in high-risk scenarios, highlighting LLM limitations in complex decision-making.

Abstract

Purpose: To evaluate the performance of large language models (LLMs) in infectious disease–specific clinical reasoning and decision-making. Patients and Methods: A comprehensive evaluation of four widely used LLMs—DeepSeek-R1, ChatGPT-5, Grok 3, and Gemini 2.5 Flash—was conducted using a dual assessment framework that combined examination-based questions with real-world clinical cases. LLMs performance was compared with that of infectious disease residents. Examination outcomes were assessed using accuracy and score rates, while clinical case responses were evaluated by expert reviewers using predefined Likert-scale criteria. In addition, doctors’ independent clinical decisions were compared with those supported by LLMs to assess the potential value of human–AI collaboration. Results: Across examination-based assessments, LLMs performed comparably to infectious disease residents, with no significant differences observed in accuracy or score rates ( p = 0.54). LLMs showed a trend toward better performance on low-order, knowledge-based questions ( p = 0.34), whereas doctors tended to perform better on simple case-based questions ( p = 0.74), particularly those requiring higher-order clinical reasoning ( p = 0.10). In real-world clinical case evaluations (n=10), LLM-generated responses achieved high ratings for accuracy, completeness, individualization, safety, and readability, with comparable performance across models ( p > 0.05). Importantly, doctors’ decision-making supported by LLMs showed a trend toward improved accuracy compared with independent decisions ( p = 0.19). Notably, for completeness, doctors supported by LLMs achieved significantly higher scores compared with both doctors alone ( p = 0.0015) and the LLMs alone ( p = 0.0010). Nevertheless, clinically meaningful errors occurred in certain high-risk scenarios, underscoring the limitations of standalone LLM decision-making. Conclusion: LLMs show substantial potential in infectious disease education and clinical decision support, particularly for knowledge-based tasks. However, their limitations in complex clinical reasoning underscore the necessity of clinician oversight. A human–AI collaborative approach appears to offer the greatest benefit, enhancing decision quality while maintaining clinical safety. Continued refinement of regulatory and medico-legal frameworks is critical to support the safe, ethical, and responsible deployment of LLMs in clinical practice. Keywords: large language models, infectious diseases, clinical decision-making, human–AI collaboration

Mark Helpful

Bookmark

Relay

View Full Paper