This study develops and compares interpretable machine learning (ML) models that independently predict the risk of diabetes and depression using a dataset from the Behavioral Risk Factor Surveillance System (BRFSS) for individuals aged 18 to 64 from 2016 to 2022. By incorporating behavioral, demographic, and socioeconomic features, the models prioritize vaping as a central predictor alongside established socioeconomic risk factors such as smoking, obesity, income, and marital status. We train and evaluate 15 top ML algorithms using 10-fold cross-validation and identify Extreme Gradient Boosting (XGBoost) as the best model, achieving 84.8 percent accuracy for diabetes and 71.2 percent for depression. To enhance clinical and policy relevance, the model is optimized using hyperparameters, and SHAP (SHapley Additive exPlanations) values were extracted to rank feature importance and interpret each predictor contribution to the model prediction. Vaping emerges as one of the top three predictors in both models, rivaling traditional risk factors like smoking. Specifically, vaping is strongly associated with increased predicted risk for diabetes (SHAP ≈ 1.99) and depression (SHAP ≈ 1.79), challenging assumptions of its relative safety compared to smoking. Socioeconomic vulnerabilities, such as low income and unmarried status, also significantly contribute to elevated risk. Racial and ethnic disparities are evident, with variable contributions to disease prediction across groups. Our findings offer three key contributions: (1) establishing vaping as a significant behavioral risk factor for both metabolic and mental health conditions; (2) introducing scalable, explainable ML models for early detection and risk stratification; and (3) proposing an integrative framework for evaluating emerging health behaviors. These insights support targeted public health interventions and inform regulatory discourse around vaping in chronic disease prevention. This study recommends that future research and public health planning should continue to integrate ML tools to enhance transparency and precision in identifying emerging behavioral risks, particularly as patterns of substance use and health behavior evolve. • Develop an interpretable machine learning framework for jointly predicting diabetes and depression risk. • Identify vaping as a leading behavioral predictor of both metabolic and mental health risk. • Reveal nonlinear effects of income and marital status on chronic disease risk through socioeconomic analysis. • Provide transparent, feature-level explanations to support interpretable population health analytics. • Offer a scalable and policy-relevant analytical tool to inform chronic disease prevention and public health decisions.
Chike et al. (Sun,) studied this question.