What question did this study set out to answer?

May 30, 2026

Machine learning model for lung cancer risk stratification using routine complete blood count exams.

Key Points

This research aims to create a machine learning model for lung cancer risk stratification using routine complete blood count tests.
Retrospective analysis of CBC tests from 53,093 individuals aged 50 and older
Used high-risk CT findings (n=1,178) for model training and biopsy-confirmed cases (n=141) for final evaluation
A ridge regression model was trained using selected CBC features
Model achieved an AUC of 0.71 (95% CI: 0.70–0.71) for overall population
In smokers, model performance was comparable with an AUC of 0.68 (95% CI: 0.67-0.68)
CBC parameters such as neutrophil count and RDW showed significant differences between high-risk cases and low-risk controls (p < 0.001)

Abstract

e20005 Background: Lung cancer remains the leading cause of cancer-related mortality worldwide, largely due to late-stage diagnosis. Although low-dose computed tomography (LDCT) enables early detection, its widespread implementation is limited by cost, resource availability, and access disparities. This retrospective study aimed to develop a machine learning model using complete blood count (CBC) tests as a low-cost tool for lung cancer risk stratification. Methods: We analyzed CBC tests from 53,093 individuals (30,313 females, 57.10%; 22,780 males, 42.90%) 50 years and older who underwent chest CT or biopsy within six months of blood testing in Grupo Fleury laboratory, Brazil. The study population was retrospectively assembled from real-world clinical data. Low-risk CT findings were used as controls (36,243 for training and 15,535 for validation). High-risk CT findings (n = 1,178), identified from radiology reports describing features highly suggestive of lung cancer and corresponding to an estimated malignancy probability ≥85%, were used exclusively as cases for model training, while biopsy-confirmed lung cancer cases (n = 141) were reserved as the only positive cases in the independent test set for final model evaluation. A ridge regression model was trained using selected CBC-derived features. Model performance was additionally evaluated in a predefined subgroup of 1,267 individuals with documented smoking status to assess performance in a high-risk population. Results: Several CBC parameters showed significant differences between high-risk CT cases and low-risk controls, including neutrophil count and RDW (p < 0.001 for both). Following feature selection, MCV, neutrophil count, and RDW were retained in the final model, which achieved an AUC of 0.71 (95% CI: 0.70–0.71). Model discrimination remained stable across bootstrap resampling. In the subgroup analysis restricted to smokers, model performance remained comparable to that observed in the overall population with an AUC of 0.68 (95% CI: 0.67-0.68) , indicating consistent discrimination in this high-risk group. Conclusions: A machine learning model based on routinely available CBC parameters demonstrated potential as a scalable and low-cost lung cancer risk stratification tool. This approach may help prioritize individuals for CT-based screening, particularly in settings with limited access to LDCT or when smoking history is unavailable. External validation is required to confirm generalizability and clinical utility.

Bookmark

Machine learning model for lung cancer risk stratification using routine complete blood count exams.

Key Points

Abstract

Cite This Study