What question did this study set out to answer?

This research aims to investigate how data noise affects federated learning compared to centralized learning.

May 20, 2026

Understanding the Impact of Data Noise in Federated Learning: Experiments & Analysis

Key Points

This research aims to investigate how data noise affects federated learning compared to centralized learning.
Developed DataNoiseGenerator toolkit for controlled data noise injection across multiple data types: image, video, audio, text, and tabular.
Conducted extensive experiments using noisy data to evaluate performance discrepancies between federated learning and centralized learning.
Analyzed root causes of federated learning's sensitivity to data noise, focusing on server aggregation effects.
Federated learning models showed significantly poorer quality due to data noise compared to centralized learning models.
The discrepancy between federated learning and centralized learning performance increased with higher noise intensity and more noisy clients.
Aggregation by the federated learning server amplified divergent updates, leading to slower global model convergence.

Abstract

Federated learning (FL) has emerged as a popular paradigm for distributed machine learning over decentralized data. A typical FL training task involves a fleet of client devices with private data and a centralized server for aggregating the global model. Data generated by FL clients, e.g., smart phones, vehicles, and cameras, is prone to noise. While the impact of data noise on centralized learning (CL) is well understood, to our best knowledge there is a lack of a systematic study from this point of view for FL. In this paper, we fill this gap by presenting an empirical investigation to provide a deeper understanding regarding the impact of data noise on FL. Our study is enabled by DataNoiseGenerator, an open-source and extensible toolkit that we developed for the injection of controlled data noise across five diverse data modalities: image, video, audio, text, and tabular data. We then carry out extensive experiments based on the noisy data generated by DataNoiseGenerator, and our experimental evaluation results reveal that FL is significantly more vulnerable to data noise compared to CL, in terms of the quality of the trained ML models. This gap between FL and CL widens as the intensity of data noise and the proportion of noisy FL clients increase. We further present a detailed analysis to diagnose the root cause of this increased sensitivity of FL to data noise. Our analysis finds that the aggregation performed by the FL server can amplify divergent updates from FL clients trained on noisy data, thereby hindering global model convergence. We conclude that data quality issues are a fundamental challenge for deploying robust FL systems and demand novel decentralized data cleaning mechanisms.

Bookmark

Understanding the Impact of Data Noise in Federated Learning: Experiments & Analysis

Key Points

Abstract

Cite This Study