Dear Editor, We appreciate the authors’ work in the recent article “Artificial Intelligence for Predicting Post-Excision Recurrence and Malignant Progression in Oral Potentially Malignant Disorders: A retrospective cohort study.” This study developed a multi-task artificial intelligence model that can simultaneously predict the risks of treatment failure, malignant progression, and lesion recurrence after surgical resection. Building upon this foundation, we offer the following insights for discussion. In accordance with the TITAN 2025guidelines, we ensured transparency in the preparation of this article1. Comment 1: issues with algorithmic bias and generalization ability Although this study demonstrates that the TabPFN-based multi-task AI model performs well in predicting treatment failure, malignant progression, and lesion recurrence in patients with oral potentially malignant disorders (OPMD), subgroup analysis reveals significant algorithmic bias2. The model achieves better performance in male patients (AUC: 0.882–1.00) than in female patients (AUC: 0.778–0.852), and also outperforms in cases without dysplasia compared to those with dysplasia. This discrepancy highlights a key limitation: the model’s reliance on demographic and histopathologic features may inadvertently exacerbate preexisting biases in clinical datasets. For instance, the underrepresentation of high-risk subgroups (e.g., female patients with lichenoid features) may stem from imbalanced training data or insufficiently captured confounding variables. If deployed without rigorous validation across diverse populations, such biases could worsen health care disparities. Furthermore, the small sample size of the external validation cohort (n = 54), which is derived from the same tertiary center as the training data, raises doubts about the model’s generalization ability. Given global variations in surgical practices, diagnostic criteria, and genetic susceptibility in OPMD management, the model’s dependence on features such as “margin status” and “dysplasia grade” – both prone to inter-observer variability – may restrict its applicability in resource-limited health care settings. Future research should prioritize multicenter, cross-regional validation and formulate unified, standardized protocols for feature collection. While decision curve analysis underscores the model’s net benefit over the WHO classification and binary dysplasia grading systems, it does not fully address the issue of clinical utility. For example, the model shows a relatively weak advantage in recurrence prediction, and false-positive cases (e.g., female patients with isolated tongue lesions) may lead to unnecessary interventions. Integrating cost-effectiveness analysis and patient-reported outcomes will enhance the model’s value for clinical translation. Comment 2: limitations in recurrence prediction and deficiencies in feature engineering The study’s focus on multidimensional data is commendable, yet the model’s relatively low performance in predicting lesion recurrence (AUC: 0.791 vs. 0.912 for malignant progression) exposes fundamental flaws in feature selection. The authors note that the influencing factors for recurrence differ from those for malignant transformation, but the incorporated features (e.g., dysplasia grade, margin status) are primarily histopathology based, which may fail to adequately capture subtle variations at the surgical or microenvironmental level3. For example, surgical techniques (laser vs conventional scalpel) and wound healing processes affect recurrence risk, but these factors are not sufficiently reflected in the model. Misclassification analysis further indicates that false negatives mostly involve buccal/palatal lesions without dysplasia, suggesting that anatomical location and epithelial integrity factors have been overlooked. In addition, the study does not incorporate molecular biomarkers (e.g., TP53 mutations, loss of heterozygosity) and immune microenvironment data – both of which have been proven crucial in OPMD progression research. While SHAP analysis identifies “dysplasia grade” and “margin status” as key features, the over-reliance on traditional parameters may undermine the model’s innovativeness compared to emerging AI methods that integrate genomic or microbiome data. For instance, recent studies have suggested a correlation between Fusobacterium nucleatum enrichment and OPMD recurrence, yet such features are absent from this model. The authors’ proposal to “identify novel predictive features” is reasonable but requires concrete implementation. Future iterations could incorporate intraoperative imaging data, digital pathology metrics, or serum biomarkers to optimize recurrence prediction. Moreover, leveraging recurrent neural networks for time-series analysis of longitudinal data (e.g., repeated biopsies) may better capture dynamic recurrence risks. Comment 3: clinical translation and ethical considerations of AI-driven monitoring The study positions the AI model as a tool to identify OPMD patients requiring “close monitoring,” but its integration into clinical workflows and associated ethical implications demand careful evaluation4. Although the model’s net benefit surpasses that of existing grading systems, its real-world effectiveness depends on seamless integration into electronic health records and surgical decision-making pathways. For example, how will a “high-risk” prediction trigger clinical actions? Will it necessitate more frequent biopsies or adjuvant therapies? The authors do not discuss potential harms, such as over-monitoring of false-positive cases or liability issues arising from the model’s failure to detect rapidly progressing cases. Furthermore, while the “interpretability” enabled by SHAP plots is a strength, it may not be sufficient to fully gain clinicians’ trust. For example, the conclusion that “elderly patients with tongue lesions” drive high-risk predictions aligns with clinical intuition but may give rise to confirmation bias. Prospective studies should assess whether AI-generated risk alerts alter surgeons’ behavior compared to standard protocols. Finally, the ethical dimensions of data usage merit emphasis. The study relies on retrospective data without obtaining informed consent, justified by the de-identification process. However, as AI models evolve toward personalized risk scoring, transparency in data sources and protection of patient autonomy must be ensured. Future frameworks need to incorporate dynamic informed consent models and address the equity of AI-driven care.
Wang et al. (Wed,) studied this question.