A machine learning ensemble model using routine EHR data accurately identified individuals at high risk of lung cancer up to three years before diagnosis, achieving a ROC AUC of 0.8715.
Case-Control
Yes
Can a machine learning model using routine EHR data accurately identify individuals at high risk of lung cancer up to three years before diagnosis?
A machine learning model using routine EHR data can accurately stratify lung cancer risk up to three years prior to diagnosis, potentially facilitating early screening.
Effect estimate: ROC AUC 0.8715
e20015 Background: Lung cancer is the leading cause of cancer-related mortality worldwide. Early detection is crucial for improving outcomes; however, existing screening tools often rely on specialized data (e.g., CT imaging and biomarkers), limiting their broad implementation. We aimed to develop a machine learning model using routine clinical and demographic data from electronic health records (EHRs) to identify individuals at high risk of lung cancer up to three years before diagnosis. Methods: We analyzed 61 million EHR entries from Russian healthcare facilities, including diagnoses, vital signs, lab results, physician visits, and demographic data. Patients with confirmed lung cancer (ICD-10 C34) were matched with controls without any cancer diagnosis. . Patients were assigned to time-to-diagnosis groups (0.5–1, 1–2, and 2–3 years before diagnosis as positive; ≥3 years before diagnosis and cancer-free controls with > 3 years observation as negative). Data within 6 months prior to diagnosis and post-diagnosis periods were excluded to reduce diagnostic work-up leakage. Models (histogram-based gradient boosting, LightGBM, random forest) were trained with class rebalancing (SMOTE) and combined via a soft-voting ensemble. The dataset was split 80:20 at the patient level to prevent leakage, with additional validation of cohorts containing other malignancies (population-representative validation cohort). Results: On the test set, the ensemble achieved an accuracy of 88.02%, sensitivity of 78.8%, specificity of 78.43%, NPV of 98.2%, F1-score of 78.61%, and a ROC AUC of 0.8715. On the population-representative validation cohort, performance metrics were: accuracy 86.02%, sensitivity 79.1%, specificity 76.41%, NPV 95.2%, F1-score 77.73%, and ROC AUC 0.8335. The final model included 53 features, with the key contributors being age, patterns of healthcare utilization (e.g., primary care and surgical visits), vital signs (respiratory rate, pulse, blood pressure), routine laboratory parameters, and utilization of respiratory-related investigations (e.g., fluorography/CT chest), rather than cancer-specific markers. Conclusions: We have developed a highly accurate artificial intelligence model for the early stratification of lung cancer risk utilizing routine, nonspecific electronic health record (EHR) data. This model obviates the need for additional testing or specialized biomarkers, thereby rendering it scalable and cost-effective for screening at the population level. Its integration within primary care EHR systems could facilitate proactive referrals for low-dose computed tomography (CT) screening for individuals at high risk, potentially enhancing early detection rates and improving clinical outcomes.
Nazarova et al. (Thu,) conducted a case-control in Lung cancer. Machine learning model (soft-voting ensemble) vs. Controls without any cancer diagnosis was evaluated on Model performance (ROC AUC) on the test set (ROC AUC 0.8715). A machine learning ensemble model using routine EHR data accurately identified individuals at high risk of lung cancer up to three years before diagnosis, achieving a ROC AUC of 0.8715.