What question did this study set out to answer?

The aim is to evaluate the impact of dataset quality on DNN performance in software engineering tasks.

February 14, 2026Open Access

Data preparation and quality for code-centric generative software engineering tasks: a systematic literature review

Key Points

The aim is to evaluate the impact of dataset quality on DNN performance in software engineering tasks.
Conducted a systematic literature review of 70 primary studies
Analyzed dataset construction methodologies and quality challenges
Evaluated proposed solutions for improving dataset quality
Identified significant issues like noise, redundancy, and imbalance in datasets
Proposed strategies such as data augmentation and automated cleaning to enhance quality
Highlighted the importance of dataset diversity and timeliness for better model generalization

Abstract

Abstract The rapid advancements in Deep Neural Networks (DNNs) have revolutionized generative software engineering tasks, including code summarization, program repair, code generation, and code translation. However, the performance of DNN models in these tasks heavily depends on the quality of their training and evaluation datasets. This systematic literature review examines 70 primary studies to comprehensively analyze dataset construction methodologies, prevalent data quality challenges, and solutions proposed to address these challenges. Our findings reveal that dataset construction processes significantly influence quality, with common issues such as noise, redundancy, imbalance, and insufficient granularity undermining model effectiveness. We identify key strategies to mitigate these problems, including data augmentation, automated cleaning techniques, and standardized validation frameworks. Furthermore, we highlight the critical role of dataset diversity and timeliness in improving model generalization. This study provides actionable insights for researchers and practitioners in the era of generative AI, where high-quality datasets are essential for developing reliable language models as software engineering tools. By emphasizing rigorous dataset curation and innovative quality assurance methods, our work bridges the gap between theoretical advancements and practical applications, enabling the creation of robust, generalizable models for real-world code-related tasks. The synthesized recommendations aim to guide future research in optimizing dataset design, fostering reproducibility, and addressing evolving challenges in data-driven software engineering.

Bookmark

View Full Paper

Bookmark

View Full Paper

Data preparation and quality for code-centric generative software engineering tasks: a systematic literature review

Key Points

Abstract

Cite This Study