What question did this study set out to answer?

This research aims to compare machine learning models for predicting diabetes risk using clinical health indicators.

May 6, 2026Open Access

Comparative Analysis of Machine Learning Models for Diabetes Risk Prediction Using Clinical Health Indicators

Puntos clave

This research aims to compare machine learning models for predicting diabetes risk using clinical health indicators.
Utilized the Pima Indians Diabetes Dataset for model evaluation
Applied six classification algorithms: Logistic Regression, Decision Tree, Random Forest, Support Vector Machine, K-Nearest Neighbors, and Naïve Bayes
Implemented data preprocessing techniques like median imputation and feature standardization
Evaluated models using metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC
Random Forest showed the highest predictive performance and classification ability
Logistic Regression and Support Vector Machine performed competitively
Glucose level and body mass index identified as significant predictors of diabetes risk
Statistical testing confirmed Random Forest, SVM, and Logistic Regression are top-performing models

Resumen

This study presents a comparative analysis of machine learning models for early diabetes risk prediction using the Pima Indians Diabetes Dataset from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). The dataset includes key health indicators such as plasma glucose concentration, body mass index (BMI), blood pressure, insulin levels, age, diabetes pedigree function, and related physiological variables. A structured machine learning pipeline was developed using six supervised classification algorithms: Logistic Regression, Decision Tree, Random Forest, Support Vector Machine, K-Nearest Neighbors, and Naïve Bayes. During preprocessing, biologically implausible zero values were treated as missing data and handled using median imputation. Feature standardization was applied to ensure uniform scaling and improve model performance stability. Model evaluation was conducted using Accuracy, Precision, Recall, F1-score, and ROC-AUC to ensure a comprehensive and balanced assessment. Cross-validation results show that all models achieve satisfactory predictive performance, with ensemble-based methods, particularly Random Forest, demonstrating the highest and most consistent classification ability. Logistic Regression and Support Vector Machine also perform competitively, indicating the presence of both linear and nonlinear relationships in the dataset. Feature importance analysis identifies glucose level and body mass index as the most significant predictors of diabetes risk, followed by genetic and demographic factors. Statistical testing confirms significant differences among models, with Random Forest, SVM, and Logistic Regression forming a statistically comparable top-performing group. Overall, the findings demonstrate that machine learning methods can effectively support the early detection of diabetes using routine health data. These models offer strong potential for integration into clinical decision-support systems to enhance early diagnosis, risk stratification, and preventive healthcare strategies.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo