What is the clinical evidence from this study?

Study design: Systematic Review. Population: Acute deterioration in ICU patients (n=355). Intervention: External validation of machine learning-based scoring systems vs. Internal validation. Primary outcome: Change in area under the receiver operating characteristic (AUROC) attributable to external validation (MD -0.037, 95% CI -0.064 to -0.017, p=<0.001).

October 12, 2023Open Access

Generalisability of AI-based scoring systems in the ICU: a systematic review and meta-analysis

Q: What are the key findings of this study?

External validation of machine learning-based ICU scoring systems demonstrated an average reduction in AUROC of -0.037 compared to internal validation, indicating lower performance at new hospitals.

Key Result

External validation of machine learning-based ICU scoring systems demonstrated an average reduction in AUROC of -0.037 compared to internal validation, indicating lower performance at new hospitals.

Study Design

Type

Systematic Review (n=355)

Structured PICO

Population

355 studies developing machine learning-based models to predict acute deterioration in adult ICU patients using routine electronic health record data, of which 39 were externally validated.

Exposure

External validation of machine learning-based scoring systems at geographically distinct hospitals.

Comparator

Internal validation performance (performance in the derivation cohort).

Outcome

Change in area under the receiver operating characteristic (AUROC) attributable to external validation.

External validation of machine learning-based ICU scoring systems is uncommon and typically reveals significantly lower performance than internal validation, highlighting the critical need for rigorous external testing before clinical implementation.

Main Result

Mean Difference: -0.037 (95% CI -0.064–-0.017)

p-value: p=<0.001

Limitations

Did not capture validation performed by prospectively collecting additional data or within clinical trials
Did not perform a risk of bias assessment
Assumes no systematic differences between studies that did and did not get externally validated
Overreliance on a few open-source datasets (MIMIC, eICU) for external validation
Exclusive use of AUROC, which may be less sensitive to changes in data than other metrics

Abstract

Abstract Background Machine learning (ML) is increasingly used to predict clinical deterioration in intensive care unit (ICU) patients through scoring systems. Although promising, such algorithms often overfit their training cohort and perform worse at new hospitals. Thus, external validation is a critical – but frequently overlooked – step to establish the reliability of predicted risk scores to translate them into clinical practice. We systematically reviewed how regularly external validation of ML-based risk scores is performed and how their performance changed in external data. Methods We searched MEDLINE, Web of Science, and arXiv for studies using ML to predict deterioration of ICU patients from routine data. We included primary research published in English before April 2022. We summarised how many studies were externally validated, assessing differences over time, by outcome, and by data source. For validated studies, we evaluated the change in area under the receiver operating characteristic (AUROC) attributable to external validation using linear mixed-effects models. Results We included 355 studies, of which 39 (11.0%) were externally validated, increasing to 17.9% by 2022. Validated studies made disproportionate use of open-source data, with two well-known US datasets (MIMIC and eICU) accounting for 79.5% of studies. On average, AUROC was reduced by -0.037 (95% CI -0.064 to -0.017) in external data, with >0.05 reduction in 38.6% of studies. Discussion External validation, although increasing, remains uncommon. Performance was generally lower in external data, questioning the reliability of some recently proposed ML-based scores. Interpretation of the results was challenged by an overreliance on the same few datasets, implicit differences in case mix, and exclusive use of AUROC.

Bookmark

View Full Paper

Bookmark

View Full Paper

Cite This Study

Rockenschaub et al. (Thu,) conducted a systematic review in Acute deterioration in ICU patients (n=355). External validation of machine learning-based scoring systems vs. Internal validation was evaluated on Change in area under the receiver operating characteristic (AUROC) attributable to external validation (MD -0.037, 95% CI -0.064 to -0.017, p=<0.001). External validation of machine learning-based ICU scoring systems demonstrated an average reduction in AUROC of -0.037 compared to internal validation, indicating lower performance at new hospitals.

synapsesocial.com/papers/6a23a337a9ac004fba9edd73 https://doi.org/https://doi.org/10.1101/2023.10.11.23296733

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: