This paper presents a unified geometric framework for understanding the working mechanisms of large language models. The core thesis is: language is not a sequence of positions, but a sequence of transformations. We reinterpret the word embedding space asfibers of a principal bundle, attention mechanisms as frame transformations, and languagegeneration as probabilistic path sampling of difference vectors. Seven tests (synthetic data,GloVe, BERT) validate the core propositions: difference vectors have low-dimensional structure (2-5 dimensions), different syntactic relations correspond to different subspaces, localsections can glue under compatibility conditions, and non-linearity emerges from twistedgluing.This paper reveals the underlying computational pattern of language models: linearcomputation → projection onto curve (non-linearity) → interpretation as probability. Linear space is the locus of “computation” (word vector dot products, attentionscores, linear layers); curve space is the locus of “understanding” (probability, confidence,decisions). Sigmoid and Softmax are converters between linearity and non-linearity, transforming the infinite into the finite and the uniform into the non-uniform. Non-linear activations serve as “scale regulators” in horizontal gluing and vertical coarsening, enablinginformation from different sources and scales to be processed uniformly.This paper reveals that the essence of the attention mechanism QKT /√dk is: measuring relevance through relative angles, projecting from linear space to curve space throughnormalization, and finally extracting information through probabilistic weighting.Tracing historical roots, this paper shows how Weaver’s (1949) sliding-window observation and the Harris/Firth (1954/1957) distributional hypothesis receive geometric mechanisms in this framework. We prove that backpropagation learning of word embeddings isequivalent to iterative eigendecomposition of the co-occurrence matrix—eigendecompositionis global projection; backpropagation is local projection. Each layer continues decomposition on the residual of the previous layer, progressively refining semantic features.We distinguish two types of probability—average probability (distributional hypothesis) forword embeddings (“who co-occurs with whom”), and specific probability (relational theory)for attention mechanisms (“who precedes whom”)—and show that induction and inferenceare two directions of the same probabilistic mechanism.We introduce Ext/Tor as algebraic quantifiers of gluing obstructions, establish the melodyharmony analogy, interpret prompts as human-provided inductive rules, and explore thepossibility and boundaries of artificial instincts. Global non-linearity is the result of locallinear regions being glued after scale adjustment through curve projection—this is preciselythe geometric meaning of horizontal gluing (disjoint union + intersection). Non-linear activation does not “destroy” linearity, but performs scale transformation on top of linearity—approximately linear in the middle region, saturating at both ends, unifying scales in gluingand coarsening.A suite of visualization tools is developed, offering a geometric perspective for understanding, debugging, and validating large language models.
Xiaobo Li (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: