Background Fluid therapy is central to sepsis management, yet recommendations on fluid type, volume, optimization, and de-escalation remain uncertain. The 2025 ESICM guidelines highlight major evidence gaps in sepsis fluid therapy. Although large language models (LLMs) show promise for guideline interpretation and clinical decision support, their performance in this high-risk domain is unclear. Methods We conducted a prospective, cross-sectional observational study using nine guideline-derived sepsis-related clinical questions addressing fluid selection, resuscitation volume, and fluid removal during de-escalation. Questions were queried in both English and Chinese across three consecutive days, generating three independent responses per model from ChatGPT-5, ChatGPT-4o, and DeepSeek-V3.1. Three blinded intensivists evaluated responses for accuracy, completeness, and consistency using 5-point Likert scales. Readability was assessed using Flesch Reading Ease (FRE) and Flesch–Kincaid Grade Level (FKGL) for English responses and a validated Chinese readability framework. Inter-rater agreement was quantified using Kendall’s W coefficient. Results In English responses, ChatGPT-5 achieved the highest accuracy, although inter-model differences were not statistically significant. In Chinese responses, ChatGPT-5 demonstrated significantly higher accuracy than ChatGPT-4o ( p < 0.05). DeepSeek-V3.1 produced significantly more complete English responses than ChatGPT-4o ( p < 0.05). Consistency was high across all models and languages. FKGL scores differed significantly among models ( p < 0.01), with ChatGPT-5 generating more linguistically complex English text. No significant differences were observed between English and Chinese responses across evaluation dimensions. Conclusions Advanced LLMs show potential for supporting sepsis fluid therapy guideline interpretation, but consistent overconfident responses in guideline-defined uncertainty domains highlight important safety limitations. Clinical oversight remains essential when deploying LLMs for high-risk decision support.
Cheng et al. (Sun,) studied this question.