Objective: To develop and validate machine learning models to predict 5-year survival in oral cancer using a large population-based registry and to rank prognostic factors. Methods: We analyzed Surveillance, Epidemiology, and End Results (SEER) data from 1992 to 2020. After applying inclusion criteria, 39 904 patients with complete data on 21 variables were included from 53 611 cases (25.6% exclusion). Selection bias was assessed by comparing included and excluded patients. The outcome was binary 5-year survival. Four models were trained and evaluated using nested fivefold cross-validation: XGBoost, LASSO, Random Forest, and logistic regression. Performance was assessed using the Brier score, area under the receiver operating characteristic curve (AUC), sensitivity, and specificity. Calibration was evaluated using slope and intercept with 95% confidence intervals. Model explainability used permutation feature importance and SHAP values. Results: Random Forest showed the best discrimination (AUC 77.6%, accuracy 71.4%, Brier 0.186) and was selected for risk stratification. However, it overestimated risk in lower deciles. Logistic regression and LASSO showed better calibration, with slopes near 1.0, but slightly lower discrimination (AUCs 75.5% and 76.9%). SHAP analysis identified localized stage as the strongest protective factor (importance 100.0), followed by age (91.1) and chemotherapy (29.3). Excluded patients had more unstaged tumors (3.9% vs 1.7%, P < .001). Conclusion: Random Forest provides strong risk stratification, but miscalibration limits its use for absolute risk prediction. Logistic regression and LASSO may be preferable when accurate probabilities are needed. External validation is required before clinical use.
Okeagu et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: