Abstract Rationale Smoking is the most crucial risk factor for the development of Chronic Obstructive Pulmonary Disease (COPD). Cases of diseased patients with no history of smoking have been explored, using Machine Learning (ML) models, to predict the risk of the disease from the interaction patterns of other relevant etiological factors. These models use clinical and demographic variables such as respiratory infection history, comorbidities, and environmental factors. They have demonstrated acceptable discrimination capability, with sensitivities between 60 and 75% and Area Under the Curve (AUC) higher than 70%. We present the development and internal validation of a predictive model to reduce reliance on smoking habits for risk prediction that uses readily available clinical and demographic variables (Table 1). Methodology Retrospective study with a source population of 150,555 electronic medical records (EMR) from the Colombian health insurance company MediSinú. The variables were collected from the EMRs through natural language processing using the Arkangel AI application for data extraction. The sample size was 1,125 records, which included patients over 40 with at least one spirometry test, no older than two years. The primary outcome was the prediction of COPD risk, defined as a spirometry with an FEV1/FVC 70%. Using different hyperparameters, imputation techniques, and population distributions, 439 ML models were trained and tested. For model robustness, a 7-fold cross-validation was used. Results The final model was selected by prioritizing the models with the best sensitivity and then for the best AUC and specificity. The chosen model’s performance metrics were 85.3%, 67.7%, and 50.5%, respectively, with a K-Nearest Classifier architecture. The SHAP values, which quantify the relevance of each variable in AI models, revealed that the most influential were occupational exposure to smoke or chemicals, smoking habits, age, and chronic cough. Conclusion The ML model achieved competitive metrics compared to current literature, using 12 readily available variables, unlike other algorithms, based on several specialized tests and genetic factors. According to SHAP values, the most influential variables agree with the literature (Smoking Habit, Asthma, Age). This reassures the relevance of cigarette smoking exposure. However, it also underlines the importance of incorporating new technologies into traditional approaches to explore new associations between conventional and non-conventional disease risk factors. It is still essential to develop predictive models for the early COPD identification, not dependent solely on smoking habits. External validation of this model is necessary to assess its performance in different contexts. This abstract is funded by: AstraZeneca
Villegas et al. (Fri,) studied this question.