March 5, 2021Open Access

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attention heads across layers. Using this decomposition, we prove that self-attention possesses a strong inductive bias towards "token uniformity". Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the identified convergence phenomena on different variants of standard transformer architectures.

Me gusta

Guardar

Ver artículo completo

Cite This Study

Dong et al. (Fri,) studied this question.

synapsesocial.com/papers/69f67343d85307304afc85d7 https://doi.org/https://doi.org/10.48550/arxiv.2103.03404

Also Consider

Synapse has enriched 4 closely related papers on similar clinical questions. Consider them for comparative context:

Me gusta

Guardar

Ver artículo completo