What is the clinical evidence from this study?

Study design: Observational. Population: Intensive Care Unit (ICU) discharge decisions (n=398). Intervention: ChatGPT and Gemini vs. ICU physicians. Primary outcome: Accuracy of ICU discharge decisions (Accuracy).

What question did this study set out to answer?

This study aims to evaluate how effectively two large language models predict ICU discharge decisions compared to ICU physicians.

June 1, 2026Open Access

Evaluation of two large language models for intensive care unit discharge decisions: a prospective observational cohort study

Key Result

ChatGPT demonstrated higher accuracy than Gemini (87.2% vs. 66.3%) in predicting ICU discharge decisions compared to physician judgments, with substantial agreement for ChatGPT (κ = 0.737).

Key Points

This study aims to evaluate how effectively two large language models predict ICU discharge decisions compared to ICU physicians.
Conducted in a tertiary ICU from September 2024 to May 2025 involving adult patients (N=398).
Patients' clinical data was input into ChatGPT and Gemini for decision-making analysis.
Performance was assessed with metrics including accuracy, sensitivity, specificity, and Cohen’s kappa.
ChatGPT showed an accuracy of 87.2% and sensitivity of 85.9%, significantly higher than Gemini's 66.3% accuracy and 46.9% sensitivity.
Gemini exhibited better specificity at 96.2% compared to ChatGPT's 89.2%.
Agreement with clinician decisions was substantial for ChatGPT (κ = 0.737, p = 0.024) but fair for Gemini (κ = 0.379, p < 0.001).

Study Design

Type

Observational (n=398)

Multicenter

Structured PICO

Do large language models (ChatGPT and Gemini) accurately predict intensive care unit discharge decisions compared to ICU physicians in adult patients?

Population

398 adult patients (≥18-years) requiring intensive care unit (ICU) discharge decisions in a tertiary ICU

Intervention

Large Language Models (ChatGPT and Gemini) processing standardized clinical prompts from electronic health records

Comparator

Decisions made by ICU physicians

Outcome

Binary discharge decisions (discharge vs. non-discharge) assessed by accuracy, sensitivity, specificity, F1 score, Cohen’s kappa, and McNemar’s test

ChatGPT demonstrated high accuracy and substantial agreement with ICU physicians for discharge decisions, suggesting potential as a clinical decision-support tool in critical care.

Main Result

Effect estimate: Accuracy

Absolute Event Rate: 87.2% vs 66.3%

Abstract

Background The aim of this study was to evaluate the effectiveness of two general-purpose Large Language Models (LLMs), ChatGPT and Gemini, in predicting Intensive Care Unit (ICU) discharge decisions (discharge vs. non-discharge). By comparing their outputs with decisions made by ICU physicians, we sought to determine the alignment of AI-generated recommendations with expert clinical judgment and assess their potential as decision-support tools in critical care. Methods This prospective observational cohort study was conducted in a tertiary ICU between September 2024 and May 2025. Adult patients (≥18-years) requiring ICU discharge decisions were included. Standardized clinical prompts were generated from electronic health records and input into ChatGPT and Gemini. The models’ binary discharge decisions were compared to those of ICU physicians. Model performance was assessed using accuracy, sensitivity, specificity, F1 score, Cohen’s kappa, and McNemar’s test. Discharge was defined as the positive class for all diagnostic performance analyses. Results A total of 398 patients were analyzed. ChatGPT demonstrated higher accuracy than Gemini (87.2% vs. 66.3%), with higher sensitivity (85.9% vs. 46.9%) and F1 score (0.890 vs. 0.628), whereas Gemini showed higher specificity (96.2% vs. 89.2%).Agreement with clinician decisions was substantial for ChatGPT (κ = 0.737, p = 0.024) and fair for Gemini (κ = 0.379, p < 0.001). Laboratory markers such as lactate, hemoglobin, and procalcitonin significantly differed between discharged and non-discharged patients. Conclusion Large language models may support ICU discharge decisions when guided by structured, guideline-informed prompting. ChatGPT achieved higher overall accuracy, sensitivity, and F1 score, whereas Gemini demonstrated higher specificity. Trial registration Externation (Discharge) of ICU, NCT06584890, registered 03 September 2024, prospectively registered, https://register.clinicaltrials.gov/prs/beta/studies/S000EVXZ00000029/recordSummary .