Do machine learning models derived from EHR data improve the prediction of ischemic stroke compared to the CHA2DS2-VASc score in adults with atrial fibrillation?
While machine learning models show potential to improve ischemic stroke prediction in atrial fibrillation compared to CHA2DS2-VASc, current evidence is limited by pervasive methodological flaws and high risk of bias, precluding clinical adoption.
BACKGROUND Atrial fibrillation (AF) is the most common sustained cardiac arrhythmia and confers a four to fivefold increase in ischemic stroke risk, accounting for approximately 15 - 20% of all stroke events globally. Despite this burden, the predominant risk stratification tool, the CHA2DS2-VASc score, achieves only modest discrimination, constrained by its static, additive architecture that cannot capture the nonlinear, high-dimensional interactions inherent in real-world electronic health record (EHR) data. This evidence gap creates a dual clinical hazard: under-anticoagulation in high-risk patients and unnecessary bleeding exposure in those whose risk is overestimated. This study aimed to systematically evaluate the predictive performance, methodological rigor, and clinical readiness of machine learning (ML) models derived from EHR data for the prediction of ischemic stroke in patients with AF. METHODS A systematic search of PubMed, Embase, Scopus, and Web of Science was conducted from inception through September 2025, following PRISMA 2020 guidelines. Studies were eligible if they developed or validated ML models for ischemic stroke prediction using EHR data in adults with AF and reported at least one quantitative performance metric. Methodological quality was assessed using the PROBAST and TRIPOD-AI frameworks. RESULTS Eight studies (2017 to 2024) encompassing 809,523 patients across seven countries were included. Supervised ensemble methods consistently outperformed CHA2DS2-VASc, with AUROCs ranging from 0.66 to 0.91 versus 0.54 to 0.68 for the traditional score. However, performance varied substantially: several models achieved only marginal gains (AUROC 0.63 - 0.69), and the AUROC range reflects pronounced heterogeneity rather than uniform superiority. Critical barriers persist - only one study performed external validation; fewer than half applied explainable AI techniques; class imbalance was rarely addressed; and 88% of studies received a high risk of bias rating in the analysis domain under PROBAST, a finding that substantially limits confidence in the reported performance estimates. CONCLUSION In light of the pervasive methodological limitations identified, including high analytic risk of bias, absence of external validation, and lack of model interpretability, claims of ML superiority over CHA2DS2-VASc must be interpreted with caution. While ML models demonstrate potential discriminative improvements, current evidence is insufficient to support clinical adoption. Translating algorithmic promise into bedside impact requires dynamic longitudinal modeling, rigorous multisite external validation, transparent risk attribution, and prospective evaluation within real-world EHR workflows.
Islam et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: