Key points are not available for this paper at this time.
The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the L₂ and the attention scores over cached KV pairs, where a low L₂ of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the L₂ of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy.
Building similarity graph...
Analyzing shared references across papers
Loading...
Devoto et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68e64779b6db6435875d908a — DOI: https://doi.org/10.48550/arxiv.2406.11430
Alessio Devoto
Yu Zhao
Simone Scardapane
Building similarity graph...
Analyzing shared references across papers
Loading...