Key points are not available for this paper at this time.
Recent research has sought to understand Transformers through the lens of in-context learning with functional data. We extend that line of work with the goal of moving closer to language models, considering categorical outcomes, nonlinear underlying models, and nonlinear attention. The contextual data are of the form C= (x₁, c₁, , xN, c₍) where each cᵢ\0, , C-1\ is drawn from a categorical distribution that depends on covariates xᵢᵈ. Contextual outcomes in the mth set of contextual data, Cₘ, are modeled in terms of latent function fₘ (x), where F is a functional class with (C-1) -dimensional vector output. The probability of observing class c\0, , C-1\ is modeled in terms of the output components of fₘ (x) via the softmax. The Transformer parameters may be trained with M contextual examples, \Cₘ\₌=₁, ₌, and the trained model is then applied to new contextual data C₌+₁ for new f₌+₁ (x). The goal is for the Transformer to constitute the probability of each category c\0, , C-1\ for a new query x₍_₌+₁+1. We assume each component of fₘ (x) resides in a reproducing kernel Hilbert space (RKHS), specifying F. Analysis and an extensive set of experiments suggest that on its forward pass the Transformer (with attention defined by the RKHS kernel) implements a form of gradient descent of the underlying function, connected to the latent vector function associated with the softmax. We present what is believed to be the first real-world demonstration of this few-shot-learning methodology, using the ImageNet dataset.
Building similarity graph...
Analyzing shared references across papers
Loading...
A. Wang
Ricardo Henao
Lawrence Carin
Building similarity graph...
Analyzing shared references across papers
Loading...
Wang et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68e68593b6db64358760ddc8 — DOI: https://doi.org/10.48550/arxiv.2405.17248