Key points are not available for this paper at this time.
This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with treebased speculative inference and verification.The key idea behind SpecInfer is leveraging small speculative models to predict the LLM's outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence.The correctness of all candidate token sequences represented by a token tree is verified against the LLM in parallel using a novel tree-based parallel decoding mechanism.SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-toend latency and computational requirement for serving generative LLMs while provably preserving model quality.
Miao et al. (Wed,) studied this question.