Class imbalance is a persistent challenge in supervised machine learning, particularly in biological datasets where minority classes represent functionally critical categories. Synthetic data generation has emerged as a principal strategy for mitigating this problem, yet systematic comparisons of classical and modern deep generative approaches remain limited. This study presents a comprehensive benchmark evaluation of four synthetic data generation methods—SMOTE, CTGAN, TVAE, and TabDDPM—across two well-established biological datasets from the UCI Machine Learning Repository: the E. coli protein localization dataset (307 samples, 6 features, 4 classes) and the yeast protein localization dataset (1299 samples, 8 features, 4 classes). Synthetic data quality was rigorously assessed using a multi-dimensional evaluation framework encompassing distributional fidelity (Fréchet Distance, Wasserstein Distance), machine learning utility (Train-on-Synthetic-Test-on-Real and Train-on-Real-Test-on-Real protocols using XGBoost version 3.2.0, Logistic Regression, Support Vector Machines, and Random Forest), and distinguishability (Classifier Two-Sample Test). The datasets are rather imbalanced. During the experiments, the dataset size increased to three times its original size while preserving the imbalanced class-sample ratio. To evaluate the quality of synthetic data, the max(AUC,1−AUC) score is proposed. This score is inversely proportional to classification performance, indicating that synthetic data are not easily distinguishable from real data. Per-class analysis reveals that minority classes remain the primary challenge across all generative methods. SMOTE and TabDDPM obtained the highest predictive utility F1-scores across both datasets. TVAE offers the strongest distributional fidelity among deep generative models, producing synthetic samples that are most difficult to distinguish from real data (lowest C2ST scores). CTGAN exhibits significant performance degradation on both small- and medium-scale datasets, with F1 utility ratios below 0.50.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ali Fatih Gündüz
Canan Batur Şahin
Malatya Turgut Özal Üniversitesi
Applied Sciences
Turgut Özal University
Malatya Turgut Özal Üniversitesi
Building similarity graph...
Analyzing shared references across papers
Loading...
Gündüz et al. (Thu,) studied this question.
synapsesocial.com/papers/69db38534fe01fead37c69fb — DOI: https://doi.org/10.3390/app16083694
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: