The Problem Large language models (LLMs) store key-value (KV) vectors in memory during text generation to avoid redundant computation. At long context lengths, this KV cache becomes the dominant memory bottleneck — a 7B-parameter model processing 4096 tokens requires over 2 GB of KV cache alone. INT4 quantization (storing each value in 4 bits instead of 16) is a standard solution, reducing memory by 4x. However, we find that naive INT4 quantization fails catastrophically on certain models, increasing perplexity by +8293 on Qwen2-7B at 4096 tokens — effectively destroying the model's output. The Fix We propose norm-separated quantization (nsep+pchan) — a simple preprocessing step that decomposes each KV vector into its magnitude (L2 norm, stored exactly) and direction (quantized to INT4 with per-channel scaling). This addresses two independent failure modes: (1) token-wise norm variation that inflates the quantization dynamic range, and (2) activation outlier channels that corrupt the quantization scale. The method is 4 lines of code, requires no training or calibration, and adds negligible computational overhead (~4 MB for 1024 tokens, <1% of KV cache). Results 44,000x improvement on the worst case (Qwen2-7B at 4096 tokens: ΔPPL +8293 → +0.19) 1885x improvement on WikiText-2 (Qwen2-7B: ΔPPL +812 → +0.43) Never degrades models where naive INT4 already works (worst case: +0.24 ΔPPL) Validated on 12 Pre-LN models from 124M to 40B parameters (GPT-2, Pythia, OPT, Qwen, Mistral, Falcon) WikiText-2 benchmark on 7 models (124M to 14B) Long-context stability verified up to 4096 tokens Practical Impact For inference providers and on-device deployment: Drop-in replacement for naive INT4 KV cache quantization Eliminates unpredictable per-model quantization failures Enables reliable 4x KV cache memory reduction across all Pre-LN architectures Compatible with existing methods (KIVI, GEAR, SmoothQuant) as a preprocessing step Repository Contents paper/ — Full paper (15 pages, 5 figures, 25 references) with LaTeX source experiments/ — All experiment scripts (reproducible on M1 Mac + Google Colab) results/ — Complete experimental results in JSON format (19 experiments) docs/ — Detailed experiment report and original experiment plan Keywords KV cache compression, quantization, large language models, INT4, activation outliers, norm separation, Pre-LN Transformer, inference optimization License Apache License 2.0 DOI 10.5281/zenodo.19602981 Related Identifiers GitHub repository: https://github.com/metaSATOKEN/norm-separated-quantization Based on: "The Arc and Its Thickness: Geometric Decomposition of Pre-LayerNorm Transformer Hidden States" (Sato, 2026)
Building similarity graph...
Analyzing shared references across papers
Loading...
Kentaro Sato
Building similarity graph...
Analyzing shared references across papers
Loading...
Kentaro Sato (Wed,) studied this question.
www.synapsesocial.com/papers/69e3203440886becb653f56d — DOI: https://doi.org/10.5281/zenodo.19602981
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: