ChatGPT demonstrated higher accuracy than Gemini (87.2% vs. 66.3%) in predicting ICU discharge decisions compared to physician judgments, with substantial agreement for ChatGPT (κ = 0.737).
Observational (n=398)
No
Do large language models (ChatGPT and Gemini) accurately predict intensive care unit discharge decisions compared to ICU physicians in adult patients?
ChatGPT demonstrated high accuracy and substantial agreement with ICU physicians for discharge decisions, suggesting potential as a clinical decision-support tool in critical care.
Effect estimate: Accuracy
Absolute Event Rate: 87.2% vs 66.3%
Background The aim of this study was to evaluate the effectiveness of two general-purpose Large Language Models (LLMs), ChatGPT and Gemini, in predicting Intensive Care Unit (ICU) discharge decisions (discharge vs. non-discharge). By comparing their outputs with decisions made by ICU physicians, we sought to determine the alignment of AI-generated recommendations with expert clinical judgment and assess their potential as decision-support tools in critical care. Methods This prospective observational cohort study was conducted in a tertiary ICU between September 2024 and May 2025. Adult patients (≥18-years) requiring ICU discharge decisions were included. Standardized clinical prompts were generated from electronic health records and input into ChatGPT and Gemini. The models’ binary discharge decisions were compared to those of ICU physicians. Model performance was assessed using accuracy, sensitivity, specificity, F1 score, Cohen’s kappa, and McNemar’s test. Discharge was defined as the positive class for all diagnostic performance analyses. Results A total of 398 patients were analyzed. ChatGPT demonstrated higher accuracy than Gemini (87.2% vs. 66.3%), with higher sensitivity (85.9% vs. 46.9%) and F1 score (0.890 vs. 0.628), whereas Gemini showed higher specificity (96.2% vs. 89.2%).Agreement with clinician decisions was substantial for ChatGPT (κ = 0.737, p = 0.024) and fair for Gemini (κ = 0.379, p < 0.001). Laboratory markers such as lactate, hemoglobin, and procalcitonin significantly differed between discharged and non-discharged patients. Conclusion Large language models may support ICU discharge decisions when guided by structured, guideline-informed prompting. ChatGPT achieved higher overall accuracy, sensitivity, and F1 score, whereas Gemini demonstrated higher specificity. Trial registration Externation (Discharge) of ICU, NCT06584890, registered 03 September 2024, prospectively registered, https://register.clinicaltrials.gov/prs/beta/studies/S000EVXZ00000029/recordSummary .
Turan et al. (Fri,) conducted a observational in Intensive Care Unit (ICU) discharge decisions (n=398). ChatGPT and Gemini vs. ICU physicians was evaluated on Accuracy of ICU discharge decisions (Accuracy). ChatGPT demonstrated higher accuracy than Gemini (87.2% vs. 66.3%) in predicting ICU discharge decisions compared to physician judgments, with substantial agreement for ChatGPT (κ = 0.737).