The Problem Large language models (LLMs) store key-value (KV) vectors in memory during text generation to avoid redundant computation. At long context lengths, this KV cache becomes the dominant memory bottleneck — a 7B-parameter model processing 4096 tokens requires over 2 GB of KV cache alone. INT4 quantization (storing each value in 4 bits instead of 16) is a standard solution, reducing memory by 4x. However, we find that naive INT4 quantization fails catastrophically on certain models, increasing perplexity by +8293 on Qwen2-7B at 4096 tokens — effectively destroying the model's output. The Fix We propose norm-separated quantization (nsep+pchan) — a simple preprocessing step that decomposes each KV vector into its magnitude (L2 norm, stored exactly) and direction (quantized to INT4 with per-channel scaling). This addresses two independent failure modes: (1) token-wise norm variation that inflates the quantization dynamic range, and (2) activation outlier channels that corrupt the quantization scale. The method is 4 lines of code, requires no training or calibration, and adds negligible computational overhead (~4 MB for 1024 tokens, <1% of KV cache). Results 44,000x improvement on the worst case (Qwen2-7B at 4096 tokens: ΔPPL +8293 → +0.19) 1885x improvement on WikiText-2 (Qwen2-7B: ΔPPL +812 → +0.43) Never degrades models where naive INT4 already works (worst case: +0.24 ΔPPL) Validated on 12 Pre-LN models from 124M to 40B parameters (GPT-2, Pythia, OPT, Qwen, Mistral, Falcon) WikiText-2 benchmark on 7 models (124M to 14B) Long-context stability verified up to 4096 tokens Practical Impact For inference providers and on-device deployment: Drop-in replacement for naive INT4 KV cache quantization Eliminates unpredictable per-model quantization failures Enables reliable 4x KV cache memory reduction across all Pre-LN architectures Compatible with existing methods (KIVI, GEAR, SmoothQuant) as a preprocessing step Repository Contents paper/ — Full paper (15 pages, 5 figures, 25 references) with LaTeX source experiments/ — All experiment scripts (reproducible on M1 Mac + Google Colab) results/ — Complete experimental results in JSON format (19 experiments) docs/ — Detailed experiment report and original experiment plan Keywords KV cache compression, quantization, large language models, INT4, activation outliers, norm separation, Pre-LN Transformer, inference optimization License Apache License 2.0 DOI 10.5281/zenodo.19602981 Related Identifiers GitHub repository: https://github.com/metaSATOKEN/norm-separated-quantization Based on: "The Arc and Its Thickness: Geometric Decomposition of Pre-LayerNorm Transformer Hidden States" (Sato, 2026)
Kentaro Sato (Wed,) studied this question.