What question did this study set out to answer?

May 29, 2026

Clinicopathology-based machine learning model for prediction of pathologic complete response to neoadjuvant chemotherapy in breast cancer.

Key Points

The aim is to develop a machine learning framework to predict pathologic complete response to neoadjuvant chemotherapy in breast cancer using clinicopathological data.
Analyzed 298 breast cancer cases from a Turkish cohort (n = 84 pCR vs. n = 214 non-pCR).
Partitioned dataset into training and independent test sets with nested cross-validation for model evaluation.
Used 12 machine learning algorithms including Logistic Regression and XGBoost, while addressing class imbalance and optimizing decision thresholds.
Logistic Regression achieved a ROC-AUC of 0.803 and 88% sensitivity in the test set (n = 60), identifying 15 out of 17 pCR cases.
The model outperformed random baseline estimates for NPV (71% to 93.1%) and PPV (26.9% to 48.4%).
HER2 expression was the strongest predictor of pCR, while ER status and AJCC stage were the strongest negative predictors.

Abstract

e12567 Background: Pathologic Complete Response (pCR) following neoadjuvant chemotherapy (NAC) is a surrogate for long-term survival in breast cancer. While pCR status informs surgical and therapeutic decisions, accurate patient-level prediction is hindered by tumor heterogeneity and skewed outcomes. Standard parameters (T/N stage, molecular subtype, HER2 status, Ki-67) are routinely documented but rarely integrated into robust decision-support tools. We developed an interpretable machine learning (ML) framework to provide calibrated, data-driven predictions of pCR using routine clinicopathological data. Methods: This retrospective study analyzed a Turkish cohort of 298 breast cancer cases (n = 84 pCR vs. n = 214 non-pCR) using 20 expert-curated variables. To ensure a leakage-proof methodology, the dataset was partitioned into training and independent hold-out test sets, with all preprocessing and feature selection parameters were fitted strictly on training data. Model performance was evaluated using nested cross-validation. We screened 12 ML algorithms, comprising linear models (Logistic Regression, SVM), distance-based (KNN) and tree-based ensembles (XGBoost, CatBoost, Random Forest), along with meta-ensemble architectures (Voting and Stacking). Optimal model selection was guided by a composite score integrating MCC, PR-AUC, F1-score and Brier score. Models were benchmarked against a 100-iteration Monte Carlo simulation. Class imbalance was addressed through balanced weighting, and outputs were calibrated using Platt scaling. Decision thresholds were tailored using a weighted Youden’s index to prioritize sensitivity. Interpretability was established through permutation importance and SHAP analysis. Results: Calibrated Logistic Regression (LR) emerged as the optimal model, achieving a ROC-AUC of 0.803. In the test set (n = 60), LR correctly identifies 15 out of 17 pCR cases, yielding 88% sensitivity and 93.1% NPV (63% specificity, 48.4% PPV). These outcomes significantly outperformed random baseline estimates (NPV:71%, PPV:26.9%), confirming a substantial predictive gain. Feature analysis identified HER2 expression (0.60) as the primary positive predictor of pCR, whereas ER status (-0.56) and AJCC stage (-0.31) were the strongest negative predictors. Other contributors included nodal status (0.11) and Ki-67 (0.11). Conclusions: We present the first clinicopathology-based ML framework for a Turkish breast cancer cohort. This calibrated LR model provides clinically meaningful pCR discrimination, potentially aiding clinical decision-making in the neoadjuvant settings.

Bookmark

Clinicopathology-based machine learning model for prediction of pathologic complete response to neoadjuvant chemotherapy in breast cancer.

Key Points

Abstract

Cite This Study