We show that decoder-only Transformer models perform retrieval by separating embeddings into dense spherical codes (sets of vectors with guaranteed angular separation), projecting that code, and amplifying it to saturate Softmax. We decouple symbolic retrieval from positional retrieval by adapting Multi-Query Associative Recall into Tuple-Structured Associative Recall (TSAR), and use it to demonstrate that attention heads down to single-digit dimensions can achieve perfect accuracy and strong length generalization on retrieval tasks. We prove by construction, based on the analysis of our trained models, that attention's single-head retrieval capacity N achieves or approaches the representational limit of its inputs, with any head dimension dₖ 2. N is thus unbounded with reals, and N 2B with total bits B distributed across dₖ. Additional dimensions allow more variety in code geometry, but do not meaningfully impact capacity. The three mechanisms of retrieval lead to a number of predictions, including concerning implications for retrieval's training dynamics and the representational costs of positional encodings.
Building similarity graph...
Analyzing shared references across papers
Loading...
Theodore Maselko
Building similarity graph...
Analyzing shared references across papers
Loading...
Theodore Maselko (Mon,) studied this question.
www.synapsesocial.com/papers/69d5f14b74eaea4b11a7ae86 — DOI: https://doi.org/10.5281/zenodo.19422845