What question did this study set out to answer?

The study aims to evaluate and compare the effectiveness of different synthetic data generation methods to address class imbalance in protein localization datasets.

April 12, 2026Open Access

Synthetic Data Augmentation for Imbalanced Tabular Protein Subcellular Localization: A Comparative Study of SMOTE, CTGAN, TVAE, and TabDDPM Methods

Read Full Paperexternally

Key Points

The study aims to evaluate and compare the effectiveness of different synthetic data generation methods to address class imbalance in protein localization datasets.
Benchmark evaluation of SMOTE, CTGAN, TVAE, and TabDDPM methods.
Utilization of two biological datasets from the UCI Machine Learning Repository.
Assessment of synthetic data quality using multi-dimensional metrics.
Implementation of machine learning models including XGBoost, Logistic Regression, and Random Forest.
SMOTE and TabDDPM showed the highest predictive utility F1-scores.
TVAE achieved the best distributional fidelity with the lowest distinguishability scores.
CTGAN had significant performance issues on both datasets, with utility ratios below 0.50.
Synthetic dataset size increased while maintaining the original class imbalance ratio.

Abstract

Class imbalance is a persistent challenge in supervised machine learning, particularly in biological datasets where minority classes represent functionally critical categories. Synthetic data generation has emerged as a principal strategy for mitigating this problem, yet systematic comparisons of classical and modern deep generative approaches remain limited. This study presents a comprehensive benchmark evaluation of four synthetic data generation methods—SMOTE, CTGAN, TVAE, and TabDDPM—across two well-established biological datasets from the UCI Machine Learning Repository: the E. coli protein localization dataset (307 samples, 6 features, 4 classes) and the yeast protein localization dataset (1299 samples, 8 features, 4 classes). Synthetic data quality was rigorously assessed using a multi-dimensional evaluation framework encompassing distributional fidelity (Fréchet Distance, Wasserstein Distance), machine learning utility (Train-on-Synthetic-Test-on-Real and Train-on-Real-Test-on-Real protocols using XGBoost version 3.2.0, Logistic Regression, Support Vector Machines, and Random Forest), and distinguishability (Classifier Two-Sample Test). The datasets are rather imbalanced. During the experiments, the dataset size increased to three times its original size while preserving the imbalanced class-sample ratio. To evaluate the quality of synthetic data, the max(AUC,1−AUC) score is proposed. This score is inversely proportional to classification performance, indicating that synthetic data are not easily distinguishable from real data. Per-class analysis reveals that minority classes remain the primary challenge across all generative methods. SMOTE and TabDDPM obtained the highest predictive utility F1-scores across both datasets. TVAE offers the strongest distributional fidelity among deep generative models, producing synthetic samples that are most difficult to distinguish from real data (lowest C2ST scores). CTGAN exhibits significant performance degradation on both small- and medium-scale datasets, with F1 utility ratios below 0.50.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Ali Fatih Gündüz

Canan Batur Şahin

Malatya Turgut Özal Üniversitesi

Journals

Applied Sciences

Actions

Institutions

Turgut Özal University

Malatya Turgut Özal Üniversitesi

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Synthetic Data Augmentation for Imbalanced Tabular Protein Subcellular Localization: A Comparative Study of SMOTE, CTGAN, TVAE, and TabDDPM Methods

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider