Learning the community structure of a large-scale graph is a fundamental problem in machine learning, computer science, and statistics. Among others, the Stochastic Block Model (SBM) serves as a canonical model for community detection and clustering, and the Massively Parallel Computation (MPC) model is a mathematical abstraction of real-world parallel computing systems which provides a powerful computational framework for handling large-scale datasets. We study the problem of exactly recovering the communities in a graph generated from the SBM in the MPC model. Specifically, given kn vertices that are partitioned into k equal-sized clusters (i.e., each has size n), a graph on these kn vertices is randomly generated such that each pair of vertices is connected with probability p if they are in the same cluster and with probability q if not, where p > q > 0. We give an MPC algorithm that recovers the ground-truth clusters when (p-q)/√p ≥˜Ω(k1/2n(-1/2+1/(2r-2))) for any integer r ∈ 3, O(log n) in O(kr/δ) rounds in the sublinear space MPC model, where each machine has local memory O(nδ) for some constant δ > 0. When (p-q)/√p≥ ˜Ω(k3/4n-1/4), we also give an MPC clustering algorithm that works in O(logs n) rounds in the s-space MPC model where each machine is only guaranteed to have memory s = Ω(log n). To implement the latter algorithm, we propose new algorithms for some basic graph operations in the s-space MPC model. Both algorithms significantly improve upon a recent result of Cohen-Addad et al. PODC'22, who gave an algorithm that only works in the sublinear space MPC model with a much stronger condition on p, q, k. Our algorithms are based on collecting the r-step neighborhood of each vertex and comparing the difference of some statistical information generated from the local neighborhoods for each pair of vertices.
Li et al. (Fri,) studied this question.