What question did this study set out to answer?

This study aims to examine various KV cache optimization strategies to enhance performance in autoregressive large language model inference.

June 7, 2026Open Access

Scaling Long-Context LLMs via Unified KV Cache Optimization: A Comparative Study of Paged Attention and Quantization

Puntos clave

This study aims to examine various KV cache optimization strategies to enhance performance in autoregressive large language model inference.
Systematic empirical study of three KV cache optimization strategies: FP16 baseline, custom quantization, and PagedAttention.
Evaluation on a single NVIDIA A100 GPU across varied input and output token lengths. Measurements of peak GPU memory, decode throughput, and other metrics were conducted.
Comparative analysis of throughput and memory efficiency for each strategy under realistic conditions.
Naive KV quantization resulted in less than 2% memory reduction but caused a 55–89% drop in throughput due to dequantization overhead.
PagedAttention improved throughput by 2.0–2.5× compared to FP16 baseline while increasing peak memory reservation by 2.2×.
Overall, memory-layout-based optimizations (PagedAttention) surpassed tensor-compression methods (quantization) in effective performance.

Resumen

The Key-Value (KV) cache is the dominant memory bottleneck in autoregressive large language model (LLM) inference, growing linearly with context length. This paper presents a systematic empirical study of three KV cache optimization strategies applied to Mistral-7B-Instruct-v0.2 on a single NVIDIA A100 GPU: (1) a full-precision FP16 baseline using Hugging- Face Transformers, (2) custom INT8/INT4/INT3 quantization via a DynamicCache-compatible QuantizedKVCache class, and (3) vLLM’s PagedAttention. Across input lengths of 512–4096 tokens and output lengths of 32–128 tokens, we measure peak GPU memory, decode throughput (tokens/sec), time-to-first-token (TTFT), prefill latency, and total generation time. Our results show that naive KV quantization reduces memory by less than 2% while incurring 55–89% throughput degradation due to dequantization overhead on every attention step. In contrast, vLLM’s PagedAttention delivers a 2.0–2.5× throughput improvement over the FP16 baseline, at the cost of 2.2× higher peak memory reservation. These findings reveal a clear architectural hierarchy: memory-layout-based optimizations (PagedAttention) outperform tensor-compression-based approaches (quantization) under realistic single-batch inference conditions.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo