March 4, 2021Open Access

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

Key Points

Key points are not available for this paper at this time.

Abstract

Attention-based architectures have become ubiquitous in machine learning, yet understanding of the reasons for their effectiveness remains limited. This proposes a new way to understand self-attention networks: we show that output can be decomposed into a sum of smaller terms, each involving the of a sequence of attention heads across layers. Using this, we prove that self-attention possesses a strong inductive bias "token uniformity". Specifically, without skip connections or-layer perceptrons (MLPs), the output converges doubly exponentially to a-1 matrix. On the other hand, skip connections and MLPs stop the output degeneration. Our experiments verify the identified convergence phenomena different variants of standard transformer architectures.

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

Key Points

Abstract

Cite This Study