Diabetes is a significant public health issue, especially among low- and middle-income groups where the availability of clinical diagnosis services is scarce or unavailable. The focus of this work is to create a machine learning (ML)-based non-invasive, affordable, and scalable framework for the early screening of diabetes from binary health survey data. The method proposed balances healthcare inequities since community-level screening can be carried out without the reliance on laboratory-based tests. Six machine learning classification models, namely Random Forest, Logistic Regression, Decision Tree, Gradient Boosting, AdaBoost, and a Voting Classifier, were implemented on the 2015 Behavioral Risk Factor Surveillance System (BRFSS) dataset, which contained over 300,000 anonymized data records. Recursive Feature Elimination and Correlation-based feature selection approaches were used to optimize the performance and simplicity of the models. Label encoding, normalization via Z-score, and class balancing based on SMOTE were performed on the data. The models were trained and tested on stratified 5-fold cross-validation, targeting performance measures such as accuracy, recall, F1-score, and ROC-AUC. Out of all models, Voting Classifier with RFE provided highest recall rate (0.62), showing strong sensitivity towards detecting high-risk persons. This again supports the use of survey-only data for efficient identification of persons at risk of developing diabetes, under non-clinical conditions. Research makes a socially significant and reproducible AI framework available for facilitating preventive care equitably, especially in underserved contexts. It is aligned with the Sustainable Development Goals (SDG 3: Good Health and Well-being, and SDG 10: Reduced Inequalities), and it has pragmatic takeaways for policymakers, public health practitioners, and NGOs who are looking for scalable digital health applications.
Sulaiman et al. (Thu,) studied this question.