This study trained and compared four machine learning classifiers - Logistic Regression, Random Forest, XGBoost, and Support Vector Machine - to predict Long COVID risk using the Mexico COVID-19 open dataset (391,979 confirmed positive cases). A stratified sample of 25,000 patients was used with SMOTE applied to the training set to address class imbalance. Features included age, sex, pneumonia, diabetes, hypertension, obesity, and other comorbidities. XGBoost achieved the best performance with a ROC-AUC of 0.895 and 85.7% test accuracy. SHAP explainability analysis identified pneumonia as the most influential predictor, followed by age and sex. Results suggest that basic clinical data available at the time of a positive COVID-19 test can be used to screen patients for Long COVID risk with meaningful accuracy.
Suhaan Thayyil (Mon,) studied this question.