What question did this study set out to answer?

The aim is to assess how different types of label noise affect machine learning model performance and fairness.

March 21, 2026

Fault Lines: Benchmarking the Impact of Label Data Quality on ML Robustness and Fairness

Key Points

The aim is to assess how different types of label noise affect machine learning model performance and fairness.
Introduced a model-agnostic benchmark called Fault Lines with 15 datasets corrupted by label noise.
Evaluated 22 state-of-the-art classification models, including transformers and boosting models.
Developed an evaluation framework to assess robustness and fairness.
Many models show strong performance under high random noise but are less robust to biased noise.
AUC scores of models like ResNet drop significantly with small amounts of biased noise.
Transformer models outperform boosting models in handling biased noise but require careful tuning.

Abstract

Artificial intelligence systems depend critically on high-quality data, yet real-world datasets are often imperfect. Label noise, such as incorrect or biased labels, can lead to suboptimal model decisions. While label noise has garnered increasing attention, existing research primarily examines random noise, employs simpler models, or relies on limited evaluation criteria. To address this, we introduce Fault Lines, a comprehensive, model-agnostic benchmark comprising 15 datasets systematically corrupted with diverse types of label noise, paired with an evaluation framework. This resource supports the evaluation of data cleaning pipelines and guides the design of models that are robust, in both performance and fairness, to label noise. We benchmark the robustness to label noise of 22 state-of-the-art classification models, including gradient boosting, transformers, and fairness-oriented models. Our findings show that many models maintain strong performance under high random noise (e.g., up to 40% noise leads to only a modest reduction in Robust GBDT performance). However, these models are significantly less robust to even small amounts of biased noise (<10%), which can cause substantial performance drops (e.g., 7% noise reduces ResNet's AUC by 4.4% on average) or maintain apparent stability at the expense of severe fairness degradation (e.g., MLP's Predictive Parity difference increases by 700% under 30% biased noise in the ACS Unemployment dataset). We investigate how different model architectures handle the impact of biased noise. Notably, transformer-based models appear more robust than boosting models when handling biased noise, though this advantage depends on tuning and comes with higher variance. Finally, we identify key factors for ML practitioners to mitigate the effects of label noise, including model selection, dataset analysis, and preprocessing.

Demander à l'IA

Bookmark

Demander à l'IA

Bookmark

Fault Lines: Benchmarking the Impact of Label Data Quality on ML Robustness and Fairness

Key Points

Abstract

Cite This Study