March 3, 2026

Predicting student dropout risk in higher education: A machine learning approach

Key Points

Dropout risk prediction achieved high AUC values of 0.94, indicating excellent model performance.
The Random Forest model, along with its Balanced variant, demonstrated recall rates exceeding 0.8 for at-risk students.
Exploratory data analysis and SMOTE technique were employed to handle class imbalance in the dataset.
SHAP values were utilized to identify and analyze the most significant risk factors affecting student dropout.

Abstract

Recently, student dropout has been a major challenge for higher education institutions, since high dropout rates generate negative consequences, both academic and social, as well as economic losses for all involved. For these reasons, this study proposes a machine learning-based approach for the early prediction of dropout risk, based on academic, socioeconomic, and demographic variables. The methodology started with an exploratory data analysis, followed by a class balancing of the dataset using the SMOTE technique. Subsequently, supervised classification algorithms such as Support Vector Machines (SVM), XGBoost, Random Forest, and Balanced Random Forest Classifier were trained to build the prediction models. GridSearchCV was used to search for the best hyperparameters. The models were evaluated using metrics robust to class imbalance, such as recall, ROC-AUC curve, geometric mean (G-Mean), precision, and F1-Score. The best-performing models were Random Forest and its Balanced Random Forest Classifier variant, without hyperparameter optimization, with AUC values of 0.94 and recall values greater than 0.8 in the positive class (at-risk students). In addition, model explanation with SHAP values was used to identify and analyze the most influential risk factors (academic and socioeconomic characteristics).

Predicting student dropout risk in higher education: A machine learning approach

Key Points

Abstract

Cite This Study