Federated learning (FL) has emerged as a popular paradigm for distributed machine learning over decentralized data. A typical FL training task involves a fleet of client devices with private data and a centralized server for aggregating the global model. Data generated by FL clients, e.g., smart phones, vehicles, and cameras, is prone to noise. While the impact of data noise on centralized learning (CL) is well understood, to our best knowledge there is a lack of a systematic study from this point of view for FL. In this paper, we fill this gap by presenting an empirical investigation to provide a deeper understanding regarding the impact of data noise on FL. Our study is enabled by DataNoiseGenerator, an open-source and extensible toolkit that we developed for the injection of controlled data noise across five diverse data modalities: image, video, audio, text, and tabular data. We then carry out extensive experiments based on the noisy data generated by DataNoiseGenerator, and our experimental evaluation results reveal that FL is significantly more vulnerable to data noise compared to CL, in terms of the quality of the trained ML models. This gap between FL and CL widens as the intensity of data noise and the proportion of noisy FL clients increase. We further present a detailed analysis to diagnose the root cause of this increased sensitivity of FL to data noise. Our analysis finds that the aggregation performed by the FL server can amplify divergent updates from FL clients trained on noisy data, thereby hindering global model convergence. We conclude that data quality issues are a fundamental challenge for deploying robust FL systems and demand novel decentralized data cleaning mechanisms.
Hu et al. (Mon,) studied this question.