Distributed learning is the standard for training large-scale models across private data silos, offering privacy and efficiency but facing challenges in Byzantine robustness and communication efficiency. Existing Byzantine-robust and communication-efficient methods rely on full gradient information, and they only converge to an unnecessarily large neighborhood around the solution. Motivated by these issues, we propose a novel Byzantine-robust and communication-efficient stochastic distributed learning method that imposes no requirements on batch size and converges to a smaller neighborhood, aligning with the theoretical lower bound. Our key innovation is leveraging Polyak Momentum to mitigate the noise caused by both biased compressors and stochastic gradients, thus defending against Byzantine workers under information compression. We provide proof of tight complexity bounds for nonconvex smooth loss functions. Finally, we validate the practical significance of our algorithm through an extensive series of experiments, benchmarking its performance on both binary classification and image classification tasks.
Liu et al. (Thu,) studied this question.