Key points are not available for this paper at this time.
We show that a constant number of self-attention layers can efficiently simulate, and be simulated by, a constant number of communication rounds of Massively Parallel Computation. As a consequence, we show that logarithmic depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sanford et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68e792d4b6db643587704479 — DOI: https://doi.org/10.48550/arxiv.2402.09268
Clayton Sanford
Daniel Hsu
Matus Telgarsky
Building similarity graph...
Analyzing shared references across papers
Loading...