This study presents a comparative analysis of machine learning models for early diabetes risk prediction using the Pima Indians Diabetes Dataset from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). The dataset includes key health indicators such as plasma glucose concentration, body mass index (BMI), blood pressure, insulin levels, age, diabetes pedigree function, and related physiological variables. A structured machine learning pipeline was developed using six supervised classification algorithms: Logistic Regression, Decision Tree, Random Forest, Support Vector Machine, K-Nearest Neighbors, and Naïve Bayes. During preprocessing, biologically implausible zero values were treated as missing data and handled using median imputation. Feature standardization was applied to ensure uniform scaling and improve model performance stability. Model evaluation was conducted using Accuracy, Precision, Recall, F1-score, and ROC-AUC to ensure a comprehensive and balanced assessment. Cross-validation results show that all models achieve satisfactory predictive performance, with ensemble-based methods, particularly Random Forest, demonstrating the highest and most consistent classification ability. Logistic Regression and Support Vector Machine also perform competitively, indicating the presence of both linear and nonlinear relationships in the dataset. Feature importance analysis identifies glucose level and body mass index as the most significant predictors of diabetes risk, followed by genetic and demographic factors. Statistical testing confirms significant differences among models, with Random Forest, SVM, and Logistic Regression forming a statistically comparable top-performing group. Overall, the findings demonstrate that machine learning methods can effectively support the early detection of diabetes using routine health data. These models offer strong potential for integration into clinical decision-support systems to enhance early diagnosis, risk stratification, and preventive healthcare strategies.
George Hezron Fabian (Mon,) studied this question.