April 29, 2024Open Access

Accelerating Production LLMs with Combined Token/Embedding Speculators

Key Points

Key points are not available for this paper at this time.

Abstract

This technical report describes the design and training of novel speculative decoding draft models, for accelerating the inference speeds of large language models in a production environment. By conditioning draft predictions on both context vectors and sampled tokens, we can train our speculators to efficiently predict high-quality n-grams, which the base model then accepts or rejects. This allows us to effectively predict multiple tokens per inference forward pass, accelerating wall-clock inference speeds of highly optimized base model implementations by a factor of 2-3x. We explore these initial results and describe next steps for further improvements.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Wertheimer et al. (Mon,) studied this question.

synapsesocial.com/papers/68e6d2d6b6db643587650468 https://doi.org/https://doi.org/10.48550/arxiv.2404.19124

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark

View Full Paper