What question did this study set out to answer?

The aim is to review distributed machine learning systems, focusing on algorithms, communication efficiency, and convergence guarantees.

April 11, 2026Open Access

Distributed Machine Learning Systems: Algorithms, Communication Eciency, and Convergence Guarantees

Key Points

The aim is to review distributed machine learning systems, focusing on algorithms, communication efficiency, and convergence guarantees.
Comprehensive survey of over 180 papers published between 2017 and 2025.
Analysis of distributed optimization algorithms like stochastic gradient descent and federated learning.
Evaluation of communication efficiency techniques including gradient compression and local SGD.
Identification of trade-offs between communication rounds, computation per round, and statistical accuracy.
Establishment of convergence guarantees under realistic scenarios like heterogeneous data and Byzantine failures.
Highlighting key open problems in distributed training regarding communication-computation trade-offs and differential privacy.

Abstract

The rapid growth of datasets and model sizes in modern machine learning hasmade distributed training not merely advantageous but essential. This survey provides a comprehensive review of distributed machine learning systems, with a focuson three interconnected aspects: (i) distributed optimization algorithms, including synchronous and asynchronous stochastic gradient descent, federated learning,and decentralized methods; (ii) communication eciency techniques such as gradient compression, quantization, and local SGD; and (iii) convergence guaranteesunder realistic assumptions including heterogeneous data, partial participation, andByzantine failures. We present a unied theoretical framework that relates communication complexity to convergence rates, identifying fundamental trade-os betweencommunication rounds, computation per round, and statistical accuracy. Our surveycovers over 180 papers published between 2017 and 2025, with systematic comparisons on standard benchmarks. We identify key open problems including optimalcommunication-computation trade-os, convergence under extreme heterogeneity,and the intersection of distributed training with dierential privacy.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper