Key points are not available for this paper at this time.
Attention-based architectures have become ubiquitous in machine learning, yet understanding of the reasons for their effectiveness remains limited. This proposes a new way to understand self-attention networks: we show that output can be decomposed into a sum of smaller terms, each involving the of a sequence of attention heads across layers. Using this, we prove that self-attention possesses a strong inductive bias "token uniformity". Specifically, without skip connections or-layer perceptrons (MLPs), the output converges doubly exponentially to a-1 matrix. On the other hand, skip connections and MLPs stop the output degeneration. Our experiments verify the identified convergence phenomena different variants of standard transformer architectures.
Dong et al. (Thu,) studied this question.