Large Language Models (LLMs) exhibit the remarkable ability to learn new tasks from a few demonstration examples without any weight updates, a phenomenon known as In-Context Learning (ICL). While empirical work suggests that ICL behaves similarly to explicit fine-tuning, a rigorous mathematical framework formalizing this equivalence for standard softmax attention remains a critical open problem. In this paper, we provide a formal proof that the forward pass of a Transformer attention layer mathematically executes a step of gradient descent on a meta-learned regression objective. We rigorously bound the Taylor expansion remainder and extend this equivalence to provide non-asymptotic sample complexity bounds for sub-Gaussian distributions. Our results explicitly decompose the excess risk into bias and variance components, proving that the implicit gradient step strictly reduces generalization error on the in-context task at a rate of O(1/N), thus providing a solid theoretical foundation for the scalability of ICL architectures.
Sittiphol Phanvilai (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: