Can machine learning models using non-invasive pre-colonoscopy features accurately predict the presence of high-risk colorectal polyps?
6,243 patients undergoing colonoscopy (4,681 in internal validation cohort from 2014-2022; 1,562 in external validation cohort from 2023-2024).
Machine learning models (neural networks, random forest, SVM, Naive Bayes, logistic regression, decision trees, KNN, and XGBoost) using pre-colonoscopy demographic, lifestyle, and comorbidity features.
Prediction of high-risk polyps (villous/tubulovillous adenoma, high-grade dysplasia, ≥10 mm in size, and/or ≥3 polyps per procedure).
Machine learning models can predict high-risk colorectal polyps using pre-colonoscopy features, but performance degrades in external validation, highlighting the need for multimodal data to achieve clinical utility.
Abstract Background: Advanced colorectal polyp risk stratification typically relies on colonoscopy and/or pathology findings, but there is interest in whether there are non-invasive features visible prior to colonoscopy that can identify which patients are at higher risk. Such a tool could help in clinical decision-making, enabling colonoscopy surveillance to be reserved for those most likely to have high-risk polyps and avoiding unnecessary procedures in those at lower risk. Methods: We developed machine learning models to predict high-risk polyps using demographic, lifestyle, and comorbidities. Patients with villous/tubulovillous adenoma, high-grade dysplasia, ≥10 mm in size, and/or ≥3 polyps per procedure were considered as having High-risk polyps (HRP), while all others were considered to be Low-risk polyps (LRP). The data set consisted of 4,681 patients from 2014 - 2022 (internal validation; 2018 HRP, 2,658 LRP) and 1,562 patients from 2023-2024 (external validation; 769 HRP, 793 LRP). Models utilized were neural networks, random forest, SVM, Naive Bayes, logistic regression, decision trees, KNN, and XGBoost. Results: The neural network achieved the best internal performance (ROC-AUC 0.7764, PR-AUC 0.75, accuracy 0.72). However, external cohort performance reduced (ROC-AUC 0.67, accuracy 0.66), suggesting overfitting or feature drift. Less complex models such as Naive Bayes, SVM, and XGBoost, while weaker internally (ROC-AUC 0.54-0.59), demonstrated stronger external performance (ROC-AUC 0.52-0.63, accuracy ∼0.53-0.60). This suggests that predictive signal in pre-colonoscopy features exists but is moderate and very sensitive to temporal and cohort variation. Model interpretability analysis using SHAP values revealed that the main variables driving predictions were age, smoking status, sex, occupation, race, and indication for colonoscopy. Additional contributors included family history of colorectal cancer in first-degree relatives, BMI, and several clinical/lifestyle factors such as ASA use, NSAID use, and alcohol use. These results highlight that while traditional clinical risk factors dominate prediction, sociodemographic variables also carry important signal. Conclusions: HRP prediction based on non-invasive pre-colonoscopy features is feasible but challenging. Performance degradation upon external validation highlights the importance of real-world generalizability and practice or demographic change effects. These findings highlight both clinical utility potential and limitations of pre-colonoscopy risk prediction, and suggest that multimodal data sources (e.g., genomics, microbiomics, imaging, social determinants) may be required to achieve clinically meaningful performance. Citation Format: Basheer Qolomany, Mrinalini Deverapall, Adeyinka O. Laiyemo, Zaki A. Sherif, Hassan Brim, Hassan Ashktorab, . Predicting high-risk colorectal polyps using pre-colonoscopy features: Machine learning model development and validation abstract. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts); 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86(7 Suppl):Abstract nr 4220.
Building similarity graph...
Analyzing shared references across papers
Loading...
Basheer Qolomany
Mrinalini Deverapall
Adeyinka Laiyemo
Cancer Research
Howard University
Building similarity graph...
Analyzing shared references across papers
Loading...
Qolomany et al. (Fri,) studied this question.
www.synapsesocial.com/papers/69d1fd9ca79560c99a0a3bb5 — DOI: https://doi.org/10.1158/1538-7445.am2026-4220
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: