What question did this study set out to answer?

This study aims to assess the effectiveness of large language models in improving AI-derived ventilator management recommendations.

May 20, 2026

C56-38 Using Large Language Models to Align an Artificial Intelligence Decision Support System With Safe Ventilator Management Practices

Key Points

This study aims to assess the effectiveness of large language models in improving AI-derived ventilator management recommendations.
Evaluated ventilator settings from an AI-CDSS during IMV in patients for the first 24 hours.
LLMs GPT-4o and Llama 4 provided updated recommendations based on clinical data and guidelines.
Results were compared using Wilcoxon signed-rank tests.
Among 11,228 patients, LLMs provided median tidal volume recommendations of 6.0 mL/kg, reducing AI-CDSS recommendation by 0.8 mL/kg for GPT-4o.
For patients with low pH, recommendations diverged significantly: GPT-4o increased by median 0.15 mL/kg and Llama 4 by 5.3 mL/kg.
Overall, LLMs preserved minute ventilation while offering varied tidal volume strategies.

Abstract

Abstract Rationale Artificial intelligence (AI)-based clinical decision support systems (CDSSs) show promise in guiding care for patients with critical illness but lack guarantees about the safety of their recommendations. We hypothesized that large language models (LLMs), informed by evidence-based guidelines, could provide feedback to a previously trained AI-CDSS for invasive mechanical ventilation (IMV) management. Methods We evaluated the ventilator settings recommended by a previously developed AI-CDSS. We presented clinical data (e.g. ventilator settings, observed respiratory rate, and pH) observed during the first 24 hours of IMV for patients receiving assist control-volume cycled ventilation to GPT-4o (Open AI) and Llama 4 Maverick (Meta). Each LLM was also presented with a recommended set tidal volume derived from the AI-CDSS. A prompt to the LLMs asked for an updated recommendation to ensure that treatment aligned with guideline-recommended lung protective ventilation strategies, including using 6 mL/kg predicted body weight (PBW) and to adjust the set respiratory rate accordingly. The treatment recommendations were compared using Wilcoxon signed-rank tests. Results Among 11,228 patients, 6,442 (57%) were men, 5,073 (45%) were white, and the median age was 62 years (interquartile range IQR 50 to 72). The median observed set tidal volume was 6.9 mL/kg PBW (IQR 6.1 to 7.6). The AI-CDSS recommended median 6.8 mL/kg PBW (IQR 6.1 to 7.5). GPT-4o recommended median 6.0 mL/kg PBW (IQR 5.8 to 6.3), a median reduction of 0.8 mL/kg (IQR 0.1 to 1.4). Llama 4 recommended median 6.0 mL/kg PBW (IQR 6.0 to 6.0), a median reduction of 0.9 mL/kg (IQR 0.2 to 1.7). For patients with a low pH (7.2), GPT-4o and Llama 4 recommended increasing the set tidal volume by median 0.15 mL/kg and 5.3 mL/kg, respectively. The median observed minute ventilation was 9.0 liters per minute (L/min) (IQR 7.6 to 11.2). The AI-CDSS recommended median 9.2 LPM (IQR 7.7 to 11.0). GPT-4o recommended median 8.5 LPM (IQR 7.0 to 10.4), a median reduction of 0.5 LPM (IQR -0.0 to 1.3). Llama 4 recommended median 9.2 LPM (IQR 7.6 to 11.2), a median reduction of 0.1 LPM (IQR -0.6 to 0.3). Conclusions LLMs like GPT-4o and Llama 4 provide reasonable distributions of recommended tidal volumes on average but divergent tidal volume strategies in response to acidemia while still preserving overall minute ventilation. Additional validation and outlier analyses are needed to determine the most effective and safe uses for LLMs in clinical care. This abstract is funded by: None

Mark Helpful

Bookmark

Relay