June 17, 2024Open Access

A Simple and Effective L₂ Norm-Based Strategy for KV Cache Compression

Key Points

Key points are not available for this paper at this time.

Abstract

The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the L₂ and the attention scores over cached KV pairs, where a low L₂ of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the L₂ of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper

Cite This Study

Devoto et al. (Mon,) studied this question.

synapsesocial.com/papers/68e64779b6db6435875d908a https://doi.org/https://doi.org/10.48550/arxiv.2406.11430

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper