Abstract Rationale Artificial intelligence (AI)-based clinical decision support systems (CDSSs) show promise in guiding care for patients with critical illness but lack guarantees about the safety of their recommendations. We hypothesized that large language models (LLMs), informed by evidence-based guidelines, could provide feedback to a previously trained AI-CDSS for invasive mechanical ventilation (IMV) management. Methods We evaluated the ventilator settings recommended by a previously developed AI-CDSS. We presented clinical data (e.g. ventilator settings, observed respiratory rate, and pH) observed during the first 24 hours of IMV for patients receiving assist control-volume cycled ventilation to GPT-4o (Open AI) and Llama 4 Maverick (Meta). Each LLM was also presented with a recommended set tidal volume derived from the AI-CDSS. A prompt to the LLMs asked for an updated recommendation to ensure that treatment aligned with guideline-recommended lung protective ventilation strategies, including using 6 mL/kg predicted body weight (PBW) and to adjust the set respiratory rate accordingly. The treatment recommendations were compared using Wilcoxon signed-rank tests. Results Among 11,228 patients, 6,442 (57%) were men, 5,073 (45%) were white, and the median age was 62 years (interquartile range IQR 50 to 72). The median observed set tidal volume was 6.9 mL/kg PBW (IQR 6.1 to 7.6). The AI-CDSS recommended median 6.8 mL/kg PBW (IQR 6.1 to 7.5). GPT-4o recommended median 6.0 mL/kg PBW (IQR 5.8 to 6.3), a median reduction of 0.8 mL/kg (IQR 0.1 to 1.4). Llama 4 recommended median 6.0 mL/kg PBW (IQR 6.0 to 6.0), a median reduction of 0.9 mL/kg (IQR 0.2 to 1.7). For patients with a low pH (7.2), GPT-4o and Llama 4 recommended increasing the set tidal volume by median 0.15 mL/kg and 5.3 mL/kg, respectively. The median observed minute ventilation was 9.0 liters per minute (L/min) (IQR 7.6 to 11.2). The AI-CDSS recommended median 9.2 LPM (IQR 7.7 to 11.0). GPT-4o recommended median 8.5 LPM (IQR 7.0 to 10.4), a median reduction of 0.5 LPM (IQR -0.0 to 1.3). Llama 4 recommended median 9.2 LPM (IQR 7.6 to 11.2), a median reduction of 0.1 LPM (IQR -0.6 to 0.3). Conclusions LLMs like GPT-4o and Llama 4 provide reasonable distributions of recommended tidal volumes on average but divergent tidal volume strategies in response to acidemia while still preserving overall minute ventilation. Additional validation and outlier analyses are needed to determine the most effective and safe uses for LLMs in clinical care. This abstract is funded by: None
Schmid et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: