Key points are not available for this paper at this time.
As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause high extra network delays.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yuhan Liu
Beijing Institute of Technology
Hanchen Li
University of Massachusetts Chan Medical School
Yihua Cheng
University of Chicago
Stanford University
University of Chicago
Microsoft (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Liu et al. (Wed,) studied this question.
synapsesocial.com/papers/6a08a717ef79633196e8c80b — DOI: https://doi.org/10.1145/3651890.3672274