Key points are not available for this paper at this time.
Transformers exhibit in-context learning (ICL), enabling adaptation to various tasks via prompts without the need for computationally intensive fine-tuning. Recent research investigates ICL's mechanisms under analytically tractable models, with some conjecturing that ICL with linear attention implements one step of gradient descent for simple linear regression tasks. This paper reevaluates this claim, revealing it relies on strong assumptions like feature independence. Relaxing these assumptions, we prove that ICL with linear attention resembles preconditioned gradient descent, with a pre-conditioner that depends on the data covariance. Our experiments support this finding. We also empirically explore softmax-attention and find that increasing the number of attention heads better approximates gradient descent. Our work offers a nuanced perspective on the connection between ICL and gradient descent, emphasizing data assumptions.
Mahdavi et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: