What question did this study set out to answer?

The aim is to characterize and evaluate the experimental design and reproducibility of software defect prediction studies.

February 22, 2026Open Access

An audit of machine learning experiments on software defect prediction

Puntos clave

The aim is to characterize and evaluate the experimental design and reproducibility of software defect prediction studies.
Conducted an audit of experiments published from SCOPUS (2019-2023).
Assessed experimental designs and statistical practices based on established norms.
Randomly sampled 101 studies from approximately 1,585 identified experiments.
Evaluated various issues related to experimental design and reproducibility.
Detected 427 issues across 101 papers, with most papers having at least four problems.
Found significant variability in datasets (1-365) and performance metrics (1-9) used.
Approximately 50% of studies provided insufficient details for reproducibility.
Only 45% of papers utilized formal statistical inference, indicating a gap in practices.

Resumen

Machine learning algorithms are increasingly being proposed to solve the problem of predicting defect-prone software components. In this literature, computational experiments are the primary means of evaluating and comparing learners and the credibility of findings depends critically on their experimental design and reporting. This paper audits recent software defect prediction (SDP) experiments by assessing their experimental design, analysis and reporting practices against widely accepted norms from statistics, machine learning and empirical software engineering. Our aim is to characterise the current state of practice and evaluate the reproducibility of published findings. We undertook an audit of relevant studies published from the SCOPUS database (2019-2023) focusing on their experimental design and analysis choices e. g. , the outcome variables such as F-measure and the type of out of sample (OOS) validation regime, e. g. , cross-validation, plus the statistical analysis and inference mechanisms. In all, we evaluated nine different study issues. This was complemented by an assessment of reproducibility using the instrument proposed by González-Barahona and Robles. Our search located approximately 1, 585 experiments in SDP (2019-2023), a substantial body of work. From this, we randomly sampled 101 (6. 4\%) papers, 61 journal and 40 conference papers. Almost 50% are behind ‘paywalls’. We found considerable divergence in research practice. The number of datasets used ranged 1-365, the number of learners or learner variants evaluated from 1-34 and the number of performance metrics from 1 to 9. Approximately 45% of papers made use of formal statistical inference. We detected a total of 427 issues distributed across 101 papers (median=4) with only one paper being entirely issue-free. In terms of reproducibility, experiments ranged from near perfect to lacking almost all required information. We also found two examples of tortured phrases and potential “paper mill” activity. Approaches to designing and reporting computational experiments varied greatly, but almost half the studies provided insufficient information such that reproduction would be challenging. Overall, our audit suggests that as a research community, we have considerable scope for improvement. Fortunately, many improvements should be neither difficult nor costly to achieve.

Me gusta

Guardar

Ver artículo completo

Cite This Study

Destefanis et al. (Fri,) studied this question.

synapsesocial.com/papers/699a9d14482488d673cd2c2a https://doi.org/https://doi.org/10.1007/s10664-025-10797-w

Me gusta

Guardar

Ver artículo completo