April 24, 2024Open Access

SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification

Key Points

Key points are not available for this paper at this time.

Abstract

This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with treebased speculative inference and verification.The key idea behind SpecInfer is leveraging small speculative models to predict the LLM's outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence.The correctness of all candidate token sequences represented by a token tree is verified against the LLM in parallel using a novel tree-based parallel decoding mechanism.SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-toend latency and computational requirement for serving generative LLMs while provably preserving model quality.

Read Full Paperexternally

AI से पूछें

Bookmark

View Full Paper

Cite This Study

Miao et al. (Wed,) studied this question.

synapsesocial.com/papers/68e6dc18b6db643587657e25 https://doi.org/https://doi.org/10.1145/3620666.3651335

AI से पूछें

Bookmark

View Full Paper