What question did this study set out to answer?

The research aims to evaluate how well different machine learning classifiers can predict Long COVID risk.

March 18, 2026Open Access

Predicting Long COVID Risk with Machine Learning: A Comparison of Four Classifiers and SHAP Explainability on a Large Population Dataset

Key Points

The research aims to evaluate how well different machine learning classifiers can predict Long COVID risk.
Trained four classifiers: Logistic Regression, Random Forest, XGBoost, and Support Vector Machine.
Used a stratified sample of 25,000 patients from a large COVID-19 dataset.
Applied SMOTE to address class imbalance in the training set.
Evaluated classifiers based on ROC-AUC and accuracy metrics.
XGBoost had the highest performance with a ROC-AUC of 0.895.
Test accuracy for the best model was 85.7%.
SHAP analysis identified pneumonia, age, and sex as significant predictors of Long COVID risk.
Basic clinical data can effectively screen for Long COVID risk.

Abstract

This study trained and compared four machine learning classifiers - Logistic Regression, Random Forest, XGBoost, and Support Vector Machine - to predict Long COVID risk using the Mexico COVID-19 open dataset (391,979 confirmed positive cases). A stratified sample of 25,000 patients was used with SMOTE applied to the training set to address class imbalance. Features included age, sex, pneumonia, diabetes, hypertension, obesity, and other comorbidities. XGBoost achieved the best performance with a ROC-AUC of 0.895 and 85.7% test accuracy. SHAP explainability analysis identified pneumonia as the most influential predictor, followed by age and sex. Results suggest that basic clinical data available at the time of a positive COVID-19 test can be used to screen patients for Long COVID risk with meaningful accuracy.

Predicting Long COVID Risk with Machine Learning: A Comparison of Four Classifiers and SHAP Explainability on a Large Population Dataset

Key Points

Abstract

Cite This Study