What type of study is this?

This is a Quantitative Study study (also classified as: Cohort Study).

September 19, 2025Open Access

Empowering Preventive Healthcare: Machine Learning-Based Diabetes Risk Screening Using Survey Data

Key Points

The Voting Classifier achieved a recall rate of 0.62, indicating strong sensitivity for diabetes risk detection.
Analysis utilized six machine learning models on the 2015 Behavioral Risk Factor Surveillance System dataset.
Optimization techniques like Recursive Feature Elimination were applied to enhance model performance and simplicity.
The framework supports preventive care in underserved areas, aligning with Sustainable Development Goals for health equity.

Abstract

Diabetes is a significant public health issue, especially among low- and middle-income groups where the availability of clinical diagnosis services is scarce or unavailable. The focus of this work is to create a machine learning (ML)-based non-invasive, affordable, and scalable framework for the early screening of diabetes from binary health survey data. The method proposed balances healthcare inequities since community-level screening can be carried out without the reliance on laboratory-based tests. Six machine learning classification models, namely Random Forest, Logistic Regression, Decision Tree, Gradient Boosting, AdaBoost, and a Voting Classifier, were implemented on the 2015 Behavioral Risk Factor Surveillance System (BRFSS) dataset, which contained over 300,000 anonymized data records. Recursive Feature Elimination and Correlation-based feature selection approaches were used to optimize the performance and simplicity of the models. Label encoding, normalization via Z-score, and class balancing based on SMOTE were performed on the data. The models were trained and tested on stratified 5-fold cross-validation, targeting performance measures such as accuracy, recall, F1-score, and ROC-AUC. Out of all models, Voting Classifier with RFE provided highest recall rate (0.62), showing strong sensitivity towards detecting high-risk persons. This again supports the use of survey-only data for efficient identification of persons at risk of developing diabetes, under non-clinical conditions. Research makes a socially significant and reproducible AI framework available for facilitating preventive care equitably, especially in underserved contexts. It is aligned with the Sustainable Development Goals (SDG 3: Good Health and Well-being, and SDG 10: Reduced Inequalities), and it has pragmatic takeaways for policymakers, public health practitioners, and NGOs who are looking for scalable digital health applications.

Empowering Preventive Healthcare: Machine Learning-Based Diabetes Risk Screening Using Survey Data

Key Points

Abstract

Cite This Study