What question did this study set out to answer?

This research aims to formalize the equivalence between in-context learning and gradient descent in transformer models.

June 23, 2026Open Access

Formalizing In-Context Learning in Transformers as Implicit Gradient Descent

Key Points

This research aims to formalize the equivalence between in-context learning and gradient descent in transformer models.
Provided a formal proof of gradient descent execution in Transformer attention layers.
Used Taylor expansion remainder for mathematical rigor.
Extended analysis to non-asymptotic sample complexity in sub-Gaussian distributions.
Proved implicit gradient steps decrease generalization error at a rate of O(1/N).
Explicitly decomposed excess risk into bias and variance components.
Established a theoretical foundation for the scalability of in-context learning architectures.

Abstract

Large Language Models (LLMs) exhibit the remarkable ability to learn new tasks from a few demonstration examples without any weight updates, a phenomenon known as In-Context Learning (ICL). While empirical work suggests that ICL behaves similarly to explicit fine-tuning, a rigorous mathematical framework formalizing this equivalence for standard softmax attention remains a critical open problem. In this paper, we provide a formal proof that the forward pass of a Transformer attention layer mathematically executes a step of gradient descent on a meta-learned regression objective. We rigorously bound the Taylor expansion remainder and extend this equivalence to provide non-asymptotic sample complexity bounds for sub-Gaussian distributions. Our results explicitly decompose the excess risk into bias and variance components, proving that the implicit gradient step strictly reduces generalization error on the in-context task at a rate of O(1/N), thus providing a solid theoretical foundation for the scalability of ICL architectures.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper