An XGBoost predictive survival model using real-world EHR data predicted recurrence-free survival in early breast cancer with a C-index of 0.76.
Observational (n=158,111)
Yes
A machine learning model using real-world data successfully predicted recurrence-free survival in patients with early breast cancer, identifying distinct risk factors for early-onset disease.
Effect estimate: C-index 0.76
547 Background: Accurate risk stratification in early breast cancer (eBC) is critical to guide adjuvant treatment intensity. In HR+/HER2- stage I-III disease, clinicians use clinical and genomic information to estimate recurrence risk, yet a comprehensive, real-world predictive model that synthesizes these inputs after curative-intent surgery is not established in routine practice. We developed a multimodal predictive survival model in a large, eBC cohort to predict recurrence-free survival (RFS) after surgery and derive risk groups to inform treatment escalation or de-escalation. Methods: This study used the US-based EHR-derived deidentified Flatiron Health Research Database (data cutoff: Sep 30, 2025). The cohort consisted of patients (pts) diagnosed with HR+/HER2- stage I-III eBC between Jan 1, 2016, and Jan 1, 2023, who received surgery and no neoadjuvant therapy. An Extreme Gradient Boosting (XGBoost) model was developed using 10-fold cross-validation, reserving 20% for testing, predicted time from surgery to recurrence or death. SHapley Additive exPlanations (SHAP) analysis was used to rank feature importance. Pts were classified into 3 risk groups by prediction percentile, and RFS by group was plotted in the test set using the Kaplan-Meier method. A second XGBoost model identified predictors specific to pts diagnosed with early onset (EO) eBC (age ≤ 45 yrs). Results: 158,111 pts qualified for the cohort, and the model C-index was 0.76. Top features that contributed to increased predicted hazard include age, higher tumor grade, higher stage, higher OncotypeDx score, longer time from diagnosis to surgery, ECOG score ≥2, and smoking history. Higher socioeconomic status (SES) decreased predicted hazard. Age had a non-linear association with predicted hazard; risk was elevated among pts 71, with lowest prediction occurring at ages 46-51. The median RFS in the high-risk group was 8.5 yrs (IQR, 8.2-8.8) and was not reached in the medium and low-risk groups. In the early onset subgroup of 12,196 pts, the C-index was 0.71. Top features in the EO model that contributed to increased hazard also included higher tumor grade and stage, and younger age. Importance of race and Ki67 percent staining (PS) superseded that of ECOG and SES in the EO model. Pts of Black or African American race and pts with Ki67 PS ≥20% had higher predicted hazard. Conclusions: This analysis elucidated key predictors of RFS following surgery in pts with eBC, specifically in non-neoadjuvant–treated pts at higher risk for recurrence, and illustrated predictors that vary in younger pts. Predictors from this large, representative dataset are aligned with known associations with recurrence risk. These findings suggest real-world data models may complement established tools guiding adjuvant treatment intensity, though external validation and prospective evaluation are needed.
Bouzit et al. (Wed,) conducted a observational in HR+/HER2- stage I-III early breast cancer (n=158,111). XGBoost predictive survival model was evaluated on Recurrence-free survival (time from surgery to recurrence or death) (C-index 0.76). An XGBoost predictive survival model using real-world EHR data predicted recurrence-free survival in early breast cancer with a C-index of 0.76.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: