In regions with scarce data, such as Norway, predicting cost performance in large-scale road (LSR) projects presents a unique challenge due to the high risk of cost overruns and their significant economic implications. This study aims to develop a data-driven framework for predicting cost performance in LSR projects by combining synthetic data generation and machine learning models. The approach employs synthetic data generation via Conditional Generative Adversarial Networks (CTGAN) to enhance the data pool and improve predictive accuracy. By integrating 173 synthetically generated samples with 52 actual project samples, a robust dataset of 225 road projects was created. Three machine learning classifiers (i.e., XGBoost, MLP, and SVM) were applied to this enriched dataset. The models achieved an average accuracy of 0.76 and an F1 score of 0.74 when tested against real-world data, demonstrating substantial alignment with actual project outcomes. Further validation with 5-fold cross-validation on the combined datasets confirmed the consistency of these results, with similar accuracy and F1 scores. This research highlights the effectiveness of synthetic data in overcoming the limitations of small datasets and underscores its potential to substantially improve decision-making in highway engineering by providing more accurate, data-driven insights for project planning, design, and management.
Mirhosseini et al. (Mon,) studied this question.