March 18, 2024Open Access

Revisiting the Equivalence of In-Context Learning and Gradient Descent: The Impact of Data Distribution

Key Points

Key points are not available for this paper at this time.

Abstract

Transformers exhibit in-context learning (ICL), enabling adaptation to various tasks via prompts without the need for computationally intensive fine-tuning. Recent research investigates ICL's mechanisms under analytically tractable models, with some conjecturing that ICL with linear attention implements one step of gradient descent for simple linear regression tasks. This paper reevaluates this claim, revealing it relies on strong assumptions like feature independence. Relaxing these assumptions, we prove that ICL with linear attention resembles preconditioned gradient descent, with a pre-conditioner that depends on the data covariance. Our experiments support this finding. We also empirically explore softmax-attention and find that increasing the number of attention heads better approximates gradient descent. Our work offers a nuanced perspective on the connection between ICL and gradient descent, emphasizing data assumptions.

Revisiting the Equivalence of In-Context Learning and Gradient Descent: The Impact of Data Distribution

Key Points

Abstract

Cite This Study

Also Consider

Also Consider