April 24, 2024Open Access

SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification

Key Points

Key points are not available for this paper at this time.

Abstract

This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with treebased speculative inference and verification.The key idea behind SpecInfer is leveraging small speculative models to predict the LLM's outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence.The correctness of all candidate token sequences represented by a token tree is verified against the LLM in parallel using a novel tree-based parallel decoding mechanism.SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-toend latency and computational requirement for serving generative LLMs while provably preserving model quality.

Read Full Paperexternally

Perguntar à IA

Bookmark

View Full Paper

Cite This Study

Miao et al. (Wed,) studied this question.

synapsesocial.com/papers/68e6dc18b6db643587657e25 https://doi.org/https://doi.org/10.1145/3620666.3651335

Perguntar à IA

Bookmark

View Full Paper