What question did this study set out to answer?

This study aims to develop and validate machine learning models for predicting five-year survival in oral cancer patients using large population data.

May 27, 2026Open Access

A Machine Learning Approach to Prognostication in Oral Cancer: Analysis of the Surveillance, Epidemiology, and End Results Database

Key Points

This study aims to develop and validate machine learning models for predicting five-year survival in oral cancer patients using large population data.
Analyzed SEER data from 1992 to 2020, including 39,904 patients with complete data.
Trained models: XGBoost, LASSO, Random Forest, and logistic regression with nested fivefold cross-validation.
Performance assessed using Brier score, AUC, sensitivity, specificity, and model explainability evaluated through SHAP values.
Random Forest model had the highest discrimination with AUC of 77.6% and accuracy of 71.4%.
Logistic regression and LASSO displayed better calibration but slightly lower discrimination (AUCs 75.5% and 76.9%).
Localized stage was identified as the strongest protective factor, followed by age and chemotherapy.

Abstract

Objective: To develop and validate machine learning models to predict 5-year survival in oral cancer using a large population-based registry and to rank prognostic factors. Methods: We analyzed Surveillance, Epidemiology, and End Results (SEER) data from 1992 to 2020. After applying inclusion criteria, 39 904 patients with complete data on 21 variables were included from 53 611 cases (25.6% exclusion). Selection bias was assessed by comparing included and excluded patients. The outcome was binary 5-year survival. Four models were trained and evaluated using nested fivefold cross-validation: XGBoost, LASSO, Random Forest, and logistic regression. Performance was assessed using the Brier score, area under the receiver operating characteristic curve (AUC), sensitivity, and specificity. Calibration was evaluated using slope and intercept with 95% confidence intervals. Model explainability used permutation feature importance and SHAP values. Results: Random Forest showed the best discrimination (AUC 77.6%, accuracy 71.4%, Brier 0.186) and was selected for risk stratification. However, it overestimated risk in lower deciles. Logistic regression and LASSO showed better calibration, with slopes near 1.0, but slightly lower discrimination (AUCs 75.5% and 76.9%). SHAP analysis identified localized stage as the strongest protective factor (importance 100.0), followed by age (91.1) and chemotherapy (29.3). Excluded patients had more unstaged tumors (3.9% vs 1.7%, P < .001). Conclusion: Random Forest provides strong risk stratification, but miscalibration limits its use for absolute risk prediction. Logistic regression and LASSO may be preferable when accurate probabilities are needed. External validation is required before clinical use.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper