The Key-Value (KV) cache is the dominant memory bottleneck in autoregressive large language model (LLM) inference, growing linearly with context length. This paper presents a systematic empirical study of three KV cache optimization strategies applied to Mistral-7B-Instruct-v0.2 on a single NVIDIA A100 GPU: (1) a full-precision FP16 baseline using Hugging- Face Transformers, (2) custom INT8/INT4/INT3 quantization via a DynamicCache-compatible QuantizedKVCache class, and (3) vLLM’s PagedAttention. Across input lengths of 512–4096 tokens and output lengths of 32–128 tokens, we measure peak GPU memory, decode throughput (tokens/sec), time-to-first-token (TTFT), prefill latency, and total generation time. Our results show that naive KV quantization reduces memory by less than 2% while incurring 55–89% throughput degradation due to dequantization overhead on every attention step. In contrast, vLLM’s PagedAttention delivers a 2.0–2.5× throughput improvement over the FP16 baseline, at the cost of 2.2× higher peak memory reservation. These findings reveal a clear architectural hierarchy: memory-layout-based optimizations (PagedAttention) outperform tensor-compression-based approaches (quantization) under realistic single-batch inference conditions.
Kalpit Patel (Fri,) studied this question.