Large language models (LLMs) built upon the Transformer architecture have achieved remarkable performance in natural language understanding, text generation and logical reasoning, while their internal working mechanisms remain poorly interpreted. This paper establishes a systematic mathematical analysis framework tailored for decoder-only Transformer LLMs, based on linear algebra, tensor analysis, probability theory, information theory, optimization dynamics and geometric deep learning. We conduct rigorous mathematical modeling and theoretical deduction on core modules including word embedding, position encoding, self-attention, feed-forward networks, training optimization and generalization reasoning, and explore the mathematical nature of semantic representation, contextual correlation, knowledge storage and logical inference within models. In this paper, we strictly distinguish between classic established Transformer theories and our original mathematical derivations and conclusions. Distinct from existing fragmented theoretical studies, this work presents six targeted novel contributions beyond conventional Transformer theories: (1) we construct the first full-process unified mathematical framework covering all core modules and the entire lifecycle of Transformer-based LLMs; (2) we provide strict mathematical proof to verify that single-head self-attention is essentially a kernel weighted average operation in reproducing kernel Hilbert space and derive the low-rank and sparse properties of attention weights; (3) we establish a high-dimensional non-convex optimization dynamics model for pre-training and mathematically prove that model training converges to flat local minima; (4) we derive a tighter upper bound of generalization error and quantify the quantitative relationship among model parameters, sequence length, training data scale and generalization performance; (5) we characterize the latent space as a low-curvature smooth Riemannian manifold and model logical reasoning as geometric transformation on this manifold; (6) we design multi-group controlled experiments on mainstream datasets to quantitatively validate all above theoretical conclusions. This paper further summarizes the inherent mathematical limitations of current Transformer LLMs and proposes feasible theoretical optimization paths, referring to state-of-the-art research published from 2021 to 2026. The outcomes of this research can provide solid mathematical theoretical support for improving model interpretability, optimizing network structures and boosting practical performance, and facilitate the transition of LLM research from empirical engineering practice to theory-driven development.
Guo et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: