The AI research community understands how transformer models work but not why they generate. Why does stacking layers produce qualitatively new capabilities? Why do emergent abilities appear at sharp thresholds rather than gradually? Why does in-context learning work when it was not explicitly designed? This paper proposes that the answer is structural. A companion paper (van der Klein, 2026d) derives a general principle: any self-similar cyclic process generates novelty because inner cycles irreversibly change the substrate on which outer cycles operate. This paper applies that principle to transformer models. Each layer applies four sequential operations (query, key, attention-weighted value, output projection) to the output of the previous layer. Each layer's processing changes the representation on which the next layer operates. The model generates because the recursive structure prevents it from merely retrieving. Three testable predictions follow: (1) layer ablation should show non-linear degradation with middle layers contributing most, (2) information distance per layer should be strictly positive and super-additive across layers, (3) deep narrow models should generate more than shallow wide models at matched parameter count.
Raimo van der Klein (Fri,) studied this question.