Accurate triage in emergency departments (EDs) is essential for timely care and efficient resource utilization, particularly in high-acuity settings. Machine learning (ML) models have the potential to enhance risk stratification; however, their performance relative to established manual triage systems remains uncertain. This prospective single-center study included 899 adult patients transported by emergency medical services (EMS) and triaged to the ED red zone. Three ML models—Random Forest, Gradient Boosting Machine, and CatBoost—were developed using routinely available clinical variables. The dataset was split into training (70%) and test (30%) sets. Model performance was evaluated using area under the receiver operating characteristic curve (AUROC), area under the precision–recall curve (AUPRC), Brier score, and classification metrics. DeLong’s test was used for statistical comparisons. Model interpretability was assessed using SHAP (Shapley Additive Explanations) analysis. Adverse clinical outcomes (intensive care unit admission or in-hospital mortality) occurred in 9.1% of patients. All manual triage systems (Emergency Severity Index, Manchester Triage System, Turkish Ministry of Health Triage System) were significantly associated with adverse outcomes (p 0.80) and showed comparable or numerically higher performance across multiple metrics, including AUPRC and Brier score. However, no statistically significant differences were observed between ML models and manual triage systems (all p > 0.05). ML models also demonstrated comparable or improved calibration, particularly in low-prevalence settings. SHAP analysis identified age and key physiological parameters as the most influential predictors. ML models demonstrated robust and consistent performance in predicting adverse outcomes among high-acuity EMS patients, with performance broadly comparable to established manual triage systems. These findings support the potential role of ML-based approaches as complementary decision support tools in ED triage, warranting further validation in multicenter settings.
Selvi et al. (Fri,) studied this question.