What question did this study set out to answer?

The study aims to quantify the impact of various data leakage types on machine learning performance across numerous datasets.

May 31, 2026Open Access

Which Leakage Types Matter? A Quantitative Landscape Across 2,047 Benchmark Datasets

Puntos clave

The study aims to quantify the impact of various data leakage types on machine learning performance across numerous datasets.
Conducted twenty-eight counterfactual experiments on 2,047 iid tabular datasets and 129 temporal datasets.
Measured the severity of four classes of data leakage using AUC as a performance metric.
Analyzed model capacity effects on memorization leakage across different algorithms.
Class II leakage showed a substantial effect, indicating approximately 90% noise exploitation inflating reported scores.
Class III memorization leakage varied significantly by model capacity with metrics ranging from d_z = 0.37 to 1.11.
Class I and IV leakages were negligible or not evident under typical validation techniques.

Resumen

Twenty-eight within-subject counterfactual experiments across 2, 047 iid tabular datasets, plus a boundary experiment on 129 temporal datasets, measure the severity of four data leakage classes in machine learning. Class I (estimation: fitting scalers on full data) is negligible: all nine conditions produce |ΔAUC| ≤ 0. 005. Class II (selection: peeking, seed cherry-picking) is substantial: the measured effect is consistent with about 90% noise exploitation inflating reported scores. Class III (memorization) scales with model capacity: dᵦ = 0. 37 (Naive Bayes) to 1. 11 (Decision Tree) at 10% duplication. Class IV (boundary) is invisible under random cross-validation. Within this iid tabular regime, the textbook emphasis is inverted: normalization leakage matters least; selection leakage at practical dataset sizes matters most.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo