BACKGROUND: Lung cancer remains one of the leading causes of cancer-related mortality worldwide. Current diagnostic strategies rely primarily on imaging examinations and histopathological biopsy, which are invasive and unsuitable for longitudinal monitoring. Therefore, there is an urgent need for non-invasive and efficient predictive models to facilitate early detection and disease staging. This study aimed to compare multiple machine learning algorithms and identify the optimal model for distinguishing benign pulmonary nodules from early- and advanced-stage lung cancer. METHODS: Between January 2024 and September 2025, a total of 1,238 patients with pulmonary nodules were registered at Nanjing Drum Tower Hospital, including 951 patients with lung cancer and 287 patients with benign pulmonary nodules. In addition, 250 healthy individuals were enrolled as a control group. Clinical characteristics, including age, sex, and routine laboratory indicators, were collected for model development. Five machine learning algorithms were constructed and evaluated. Model performance was assessed using the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, and precision. Model interpretability was further explored using SHapley Additive exPlanations (SHAP). RESULTS: Among the evaluated algorithms, the eXtreme Gradient Boosting (XGBoost) model demonstrated the best overall performance and was selected as the final predictive model. Correlation and differential analyses identified 60, 37, 25, 17, 12, and 17 informative features across six comparison groups: healthy vs. benign, healthy vs. early-stage lung cancer, healthy vs. advanced-stage lung cancer, benign vs. early-stage lung cancer, benign vs. advanced-stage lung cancer, and early vs. advanced-stage lung cancer, respectively. The corresponding AUC values were 0.999, 0.962, 0.970, 0.663, 0.926, and 0.713. SHAP analysis further elucidated the relative importance and directional effects of individual clinical features on model predictions. CONCLUSION: The XGBoost-based models demonstrated comparatively better performance in comparisons involving healthy controls and in distinguishing benign pulmonary disease from advanced-stage lung cancer. This approach may assist clinicians in early identification, risk stratification, and timely intervention for patients with pulmonary nodules, potentially contributing to reduced lung cancer-related mortality.
Wu et al. (Tue,) studied this question.