Artificial intelligence (AI) has the potential to support clinicians in high-risk and complex decision-making processes, such as mechanical ventilation. This prospective observational study aimed to compare mechanical ventilator settings determined by emergency physician (EP) with recommendations generated by three large language models (ChatGPT-5, Gemini, and Copilot) in the emergency department (ED). This prospective, analytical, single-center study included 30 intubated patients managed in an ED over a three-month period. Clinical data, including diagnoses, vital signs, and initial arterial blood gas parameters, were presented to ChatGPT-5, Gemini, and Copilot. The AI models’ recommendations for ventilation mode, tidal volume, respiratory rate, PEEP, and FiO₂ were compared with the initial settings adjusted by EP. Agreement for ventilator mode selection was assessed using Cohen’s kappa statistics, while agreement for continuous ventilator parameters was evaluated using Bland–Altman analysis. A total of 30 patients were included in the study. The median age was 73 years (IQR: 60–84), and 66.7% were male. When the ventilator setting preferences of the EP were analyzed, the most commonly used modes were VCV (46.7%) and SIMV (40.0%). Among the AI models, ChatGPT-5 primarily recommended VCV (76.7%) and, to a lesser extent, CPAP (10.0%); Gemini most frequently preferred VCV (56.7%) and PCV (43.3%); and Copilot predominantly recommended PCV (70.0%). Data on the compatibility of mechanical ventilator mode selection revealed that AI models showed ‘poor’ agreement with expert opinion (EP) based on diagnosis. ChatGPT showed 50.0% agreement (Cohen’s kappa: 0.199; 95% Confidence Interval (CI): −0.087 to 0.486), Google Gemini 43.3% agreement (Cohen’s kappa: 0.164; 95% CI: −0.098 to 0.426), and Microsoft Copilot 20.0% agreement (Cohen’s kappa: −0.043; 95% CI: −0.230 to 0.143). Agreement between AI-generated ventilator settings and the EP was limited. Current AI models may offer supportive input; however, these findings should be interpreted as preliminary and exploratory, and further large-scale, multicenter studies are needed to validate these results. Not applicable.
ALTUNTAŞ et al. (Thu,) studied this question.