July 31, 2024Open Access

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

Key Points

Key points are not available for this paper at this time.

Abstract

As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause high extra network delays.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Liu et al. (Wed,) studied this question.

synapsesocial.com/papers/6a08a717ef79633196e8c80b https://doi.org/https://doi.org/10.1145/3651890.3672274

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark

View Full Paper