July 31, 2024Open Access

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

Key Points

Key points are not available for this paper at this time.

Abstract

As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause high extra network delays.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yuhan Liu

Beijing Institute of Technology

Hanchen Li

University of Massachusetts Chan Medical School

Yihua Cheng

University of Chicago

Actions

Institutions

Stanford University

University of Chicago

Microsoft (United States)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study