The rapid growth of datasets and model sizes in modern machine learning hasmade distributed training not merely advantageous but essential. This survey provides a comprehensive review of distributed machine learning systems, with a focuson three interconnected aspects: (i) distributed optimization algorithms, including synchronous and asynchronous stochastic gradient descent, federated learning,and decentralized methods; (ii) communication eciency techniques such as gradient compression, quantization, and local SGD; and (iii) convergence guaranteesunder realistic assumptions including heterogeneous data, partial participation, andByzantine failures. We present a unied theoretical framework that relates communication complexity to convergence rates, identifying fundamental trade-os betweencommunication rounds, computation per round, and statistical accuracy. Our surveycovers over 180 papers published between 2017 and 2025, with systematic comparisons on standard benchmarks. We identify key open problems including optimalcommunication-computation trade-os, convergence under extreme heterogeneity,and the intersection of distributed training with dierential privacy.
Ahmed Cherif (Thu,) studied this question.