Pulse Explore Journal Club Debates Trending Researchers Journals

Join discussions, follow papers, and never miss your next session.

Download on theApp Store

Home Explore Journal Club Trending

⌘+K

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints | Synapse

January 1, 2023Open Access

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

JAJoshua AinslieUniversity of Southern California JLJames Lee-ThorpGoogle (United States)MJMichiel de JongUniversity of Southern California

Key Points

Key points are not available for this paper at this time.

Abstract

Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show that uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA.

Ask AI

Helpful

Bookmark

Share

View Full Paper

Ask AI

Helpful

Bookmark

Share

View Full Paper

Cite This Study

Ainslie et al. (Sun,) studied this question.

synapsesocial.com/papers/6a0331f975054b3fdf9e262b https://doi.org/https://doi.org/10.18653/v1/2023.emnlp-main.298