Recently, student dropout has been a major challenge for higher education institutions, since high dropout rates generate negative consequences, both academic and social, as well as economic losses for all involved. For these reasons, this study proposes a machine learning-based approach for the early prediction of dropout risk, based on academic, socioeconomic, and demographic variables. The methodology started with an exploratory data analysis, followed by a class balancing of the dataset using the SMOTE technique. Subsequently, supervised classification algorithms such as Support Vector Machines (SVM), XGBoost, Random Forest, and Balanced Random Forest Classifier were trained to build the prediction models. GridSearchCV was used to search for the best hyperparameters. The models were evaluated using metrics robust to class imbalance, such as recall, ROC-AUC curve, geometric mean (G-Mean), precision, and F1-Score. The best-performing models were Random Forest and its Balanced Random Forest Classifier variant, without hyperparameter optimization, with AUC values of 0.94 and recall values greater than 0.8 in the positive class (at-risk students). In addition, model explanation with SHAP values was used to identify and analyze the most influential risk factors (academic and socioeconomic characteristics).
Peralta et al. (Wed,) studied this question.