What question did this study set out to answer?

The aim is to explore how decoder-only Transformer models perform retrieval through specific geometric mechanisms in embeddings.

April 8, 2026Open Access

Separate, Project, and Amplify: Attention's Geometry of Retrieval

Key Points

The aim is to explore how decoder-only Transformer models perform retrieval through specific geometric mechanisms in embeddings.
Adaptation of Multi-Query Associative Recall to Tuple-Structured Associative Recall (TSAR)
Analysis of attention heads and their dimensionality with regard to retrieval accuracy
Constructional proof of attention's single-head retrieval capacity based on trained models
Attention heads in single-digit dimensions achieve perfect accuracy on retrieval tasks
The retrieval capacity $N$ approaches the representational limit of inputs with any head dimension $d_k \geq 2$
Additional dimensions improve code geometry variety but do not significantly affect capacity

Abstract

We show that decoder-only Transformer models perform retrieval by separating embeddings into dense spherical codes (sets of vectors with guaranteed angular separation), projecting that code, and amplifying it to saturate Softmax. We decouple symbolic retrieval from positional retrieval by adapting Multi-Query Associative Recall into Tuple-Structured Associative Recall (TSAR), and use it to demonstrate that attention heads down to single-digit dimensions can achieve perfect accuracy and strong length generalization on retrieval tasks. We prove by construction, based on the analysis of our trained models, that attention's single-head retrieval capacity N achieves or approaches the representational limit of its inputs, with any head dimension dₖ 2. N is thus unbounded with reals, and N 2B with total bits B distributed across dₖ. Additional dimensions allow more variety in code geometry, but do not meaningfully impact capacity. The three mechanisms of retrieval lead to a number of predictions, including concerning implications for retrieval's training dynamics and the representational costs of positional encodings.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Theodore Maselko

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Separate, Project, and Amplify: Attention's Geometry of Retrieval

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study