What question did this study set out to answer?

June 4, 2026Open Access

Comparative performance of ChatGPT and DeepSeek in interpreting the 2025 ESICM guidelines on sepsis fluid therapy

Key Points

This study aims to compare the performance of ChatGPT and DeepSeek in interpreting the 2025 ESICM guidelines on sepsis fluid therapy.
Prospective, cross-sectional observational study with nine clinical questions derived from guidelines.
Responses generated in English and Chinese from ChatGPT-5, ChatGPT-4o, and DeepSeek-V3.1, assessed by three blinded intensivists.
Evaluation of accuracy, completeness, consistency, and readability using defined scales and frameworks.
ChatGPT-5 achieved the highest accuracy in English responses, but no statistically significant inter-model differences were observed.
In Chinese, ChatGPT-5 showed significantly higher accuracy than ChatGPT-4o (p < 0.05).
DeepSeek-V3.1 provided significantly more complete responses than ChatGPT-4o in English (p < 0.05).

Abstract

Background Fluid therapy is central to sepsis management, yet recommendations on fluid type, volume, optimization, and de-escalation remain uncertain. The 2025 ESICM guidelines highlight major evidence gaps in sepsis fluid therapy. Although large language models (LLMs) show promise for guideline interpretation and clinical decision support, their performance in this high-risk domain is unclear. Methods We conducted a prospective, cross-sectional observational study using nine guideline-derived sepsis-related clinical questions addressing fluid selection, resuscitation volume, and fluid removal during de-escalation. Questions were queried in both English and Chinese across three consecutive days, generating three independent responses per model from ChatGPT-5, ChatGPT-4o, and DeepSeek-V3.1. Three blinded intensivists evaluated responses for accuracy, completeness, and consistency using 5-point Likert scales. Readability was assessed using Flesch Reading Ease (FRE) and Flesch–Kincaid Grade Level (FKGL) for English responses and a validated Chinese readability framework. Inter-rater agreement was quantified using Kendall’s W coefficient. Results In English responses, ChatGPT-5 achieved the highest accuracy, although inter-model differences were not statistically significant. In Chinese responses, ChatGPT-5 demonstrated significantly higher accuracy than ChatGPT-4o ( p < 0.05). DeepSeek-V3.1 produced significantly more complete English responses than ChatGPT-4o ( p < 0.05). Consistency was high across all models and languages. FKGL scores differed significantly among models ( p < 0.01), with ChatGPT-5 generating more linguistically complex English text. No significant differences were observed between English and Chinese responses across evaluation dimensions. Conclusions Advanced LLMs show potential for supporting sepsis fluid therapy guideline interpretation, but consistent overconfident responses in guideline-defined uncertainty domains highlight important safety limitations. Clinical oversight remains essential when deploying LLMs for high-risk decision support.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Cheng et al. (Sun,) studied this question.

synapsesocial.com/papers/6a2117dfd499ed480b170c20 https://doi.org/https://doi.org/10.1177/20552076261458159

Bookmark

View Full Paper