What is the clinical evidence from this study?

Study design: Case-Control. Population: Lung cancer. Intervention: Machine learning model (soft-voting ensemble) vs. Controls without any cancer diagnosis. Primary outcome: Model performance (ROC AUC) on the test set (ROC AUC 0.8715).

What question did this study set out to answer?

The aim was to develop a machine learning model using routine clinical data to identify high-risk individuals for lung cancer.

May 30, 2026

Large-scale EMR-based machine learning for early risk stratification of lung cancer.

Q: What does this research mean for the field?

A machine learning ensemble model utilizing routine, nonspecific electronic health record data can accurately stratify lung cancer risk up to three years prior to diagnosis, achieving a ROC AUC of 0.8715. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

Key Result

A machine learning ensemble model using routine EHR data accurately identified individuals at high risk of lung cancer up to three years before diagnosis, achieving a ROC AUC of 0.8715.

Key Points

The aim was to develop a machine learning model using routine clinical data to identify high-risk individuals for lung cancer.
Analyzed 61 million electronic health record entries from Russian healthcare facilities.
Matched lung cancer patients with controls, excluding data around diagnosis to minimize leakage.
Trained models using ensemble techniques with class rebalancing.
Achieved 88.02% accuracy and 78.8% sensitivity on the test set.
Validation cohort showed 86.02% accuracy and 79.1% sensitivity.
Key features included age, healthcare utilization patterns, and vital signs.

Study Design

Type

Case-Control

Multicenter

Yes

Structured PICO

Can a machine learning model using routine EHR data accurately identify individuals at high risk of lung cancer up to three years before diagnosis?

Population

Patients with confirmed lung cancer (ICD-10 C34) matched with controls without any cancer diagnosis from Russian healthcare facilities.

Intervention

Machine learning model (soft-voting ensemble of histogram-based gradient boosting, LightGBM, random forest) using routine clinical and demographic data from EHRs.

Outcome

Predictive performance (accuracy, sensitivity, specificity, NPV, F1-score, ROC AUC) for identifying high risk of lung cancer up to 3 years before diagnosis.

A machine learning model using routine EHR data can accurately stratify lung cancer risk up to three years prior to diagnosis, potentially facilitating early screening.

Main Result

Effect estimate: ROC AUC 0.8715

Abstract

e20015 Background: Lung cancer is the leading cause of cancer-related mortality worldwide. Early detection is crucial for improving outcomes; however, existing screening tools often rely on specialized data (e.g., CT imaging and biomarkers), limiting their broad implementation. We aimed to develop a machine learning model using routine clinical and demographic data from electronic health records (EHRs) to identify individuals at high risk of lung cancer up to three years before diagnosis. Methods: We analyzed 61 million EHR entries from Russian healthcare facilities, including diagnoses, vital signs, lab results, physician visits, and demographic data. Patients with confirmed lung cancer (ICD-10 C34) were matched with controls without any cancer diagnosis. . Patients were assigned to time-to-diagnosis groups (0.5–1, 1–2, and 2–3 years before diagnosis as positive; ≥3 years before diagnosis and cancer-free controls with > 3 years observation as negative). Data within 6 months prior to diagnosis and post-diagnosis periods were excluded to reduce diagnostic work-up leakage. Models (histogram-based gradient boosting, LightGBM, random forest) were trained with class rebalancing (SMOTE) and combined via a soft-voting ensemble. The dataset was split 80:20 at the patient level to prevent leakage, with additional validation of cohorts containing other malignancies (population-representative validation cohort). Results: On the test set, the ensemble achieved an accuracy of 88.02%, sensitivity of 78.8%, specificity of 78.43%, NPV of 98.2%, F1-score of 78.61%, and a ROC AUC of 0.8715. On the population-representative validation cohort, performance metrics were: accuracy 86.02%, sensitivity 79.1%, specificity 76.41%, NPV 95.2%, F1-score 77.73%, and ROC AUC 0.8335. The final model included 53 features, with the key contributors being age, patterns of healthcare utilization (e.g., primary care and surgical visits), vital signs (respiratory rate, pulse, blood pressure), routine laboratory parameters, and utilization of respiratory-related investigations (e.g., fluorography/CT chest), rather than cancer-specific markers. Conclusions: We have developed a highly accurate artificial intelligence model for the early stratification of lung cancer risk utilizing routine, nonspecific electronic health record (EHR) data. This model obviates the need for additional testing or specialized biomarkers, thereby rendering it scalable and cost-effective for screening at the population level. Its integration within primary care EHR systems could facilitate proactive referrals for low-dose computed tomography (CT) screening for individuals at high risk, potentially enhancing early detection rates and improving clinical outcomes.

Bookmark