What question did this study set out to answer?

Investigate a minimalist approach for KV cache quantization that minimizes training and calibration while maintaining low perplexity on Transformers.

May 24, 2026Open Access

Norm Separation is Necessary: A Minimalist Calibration-Free INT3 Recipe for KV Cache Quantization

Key Points

Investigate a minimalist approach for KV cache quantization that minimizes training and calibration while maintaining low perplexity on Transformers.
Develop a minimalist recipe using norm separation and per-channel quantization without calibration or additional training.
Test across eight open-weight Transformers ranging from GPT-J-6B to Gemma-4-E4B.
Perform ablation studies to confirm necessity of each method component.
Achieved a 66% reduction in mean WikiText-2 ΔPPL versus symmetric per-channel quantization at 3.0 bits/value.
No statistically significant difference found between the proposed method and FP16 in downstream performance probes.
Failures with naive application of the quantization recipe confirmed the importance of norm separation in preserving model performance.

Abstract

KV cache quantization is one of the principal levers for deploying long-context Transformer inference on fixed memory. We ask: at the INT3 tier (3 bits/value, ~5.09× measured compression including FP16 metadata overhead), is there a minimalist recipe — no calibration, no training, no codebook, no rotation, no adapter, roughly ten lines of Python — that keeps WikiText-2 ΔPPL below 5% of baseline on a broad set of open-weight Transformers? We show that the composition of norm separation, per-channel quantization, and an asymmetric min, max window achieves this on eight open-weight Transformers spanning GPT-J-6B (2021) through Qwen3.5-9B and Gemma-4-E4B (2026), reducing mean WikiText-2 ΔPPL by 66% versus a symmetric per-channel variant at identical 3.0 bits/value. A four-method ablation establishes that each of the three ingredients is empirically necessary. Most informatively, the naive transplant of the asymmetric per-channel recipe from weight quantization (GPTQ, AWQ) to KV cache — without norm separation — fails on five of eight models, catastrophically on three, with regressions up to 24× versus a naive per-row INT3 baseline on Llama-3.1-8B. A direct measurement of per-token L2 norm dispersion localizes this failure to the V cache: the Layer-0 V-CV correlates with the failure magnitude at Spearman ρ ≈ 0.88, while the analogous K-CV does not (ρ ≈ 0.29). We argue that norm separation is therefore a structural ingredient for per-channel KV quantization specifically because V-cache reconstruction error bypasses the softmax normalization that partially absorbs K-side error. Two bounded downstream probes — MMLU 5-shot on two models and a 3-model multi-needle Needle-in-a-Haystack probe at 8K and 32K context (pooled n = 30) — show no statistically significant difference between the proposed recipe and FP16. Under the symmetric-only INT3 variant at 32K, NIH pass rate drops to 50% and produces bit-mixing hallucinations (cross-needle character fusion such as "3X-RED-1984" — RED not in any needle, 1984 from a different needle) that are absent under FP16 and under the proposed recipe. Two additional empirical verifications close common implementation concerns: Real bit-packed cache end-to-end PPL (Phase 26): the WikiText-2 ΔPPL numbers measured under the simulator are reproduced by the actual bit-packed Int3AsymNsepPchanCache to within < 1e-3 PPL on Qwen2-7B (8.4727 vs 8.4730) and Llama-3.1-8B (7.4900 vs 7.4898), confirming that the reported quality numbers are not simulator artifacts. Per-token decode latency (Phase 27): the decode-side overhead of the v0.3 recipe is +2.4% on Qwen2-7B and +4.2% on Llama-3.1-8B under torch.compile(dynamic=False) on a single A100 (1024-token prefix, 50 decode steps, median of 3 repeats). No hand-written CUDA or Triton kernel is required; the full low-overhead path is inside the standard PyTorch toolchain. Earlier "+21%" estimates from a naive per-head Python loop are dominated by CUDA kernel-launch latency, which is removed by per-layer batching plus torch.compile fusion. We also document a separate, secondary phenomenon: the Qwen2-7B "INT4 valley" — a non-monotonic INT4 ΔPPL spike (+1382 versus +12 at INT3 and +5 at INT5) under naive per-row symmetric quantization — and three further non-monotone Qwen-family checkpoints (Qwen2.5-1.5B / 3B / 7B) characterized at the same bit-width granularity. The proposed v0.3 recipe handles all four valley cases under the same fixed bit budget without bifurcating on mode classification. Scope limitations (explicit, also discussed in §6.3 of the paper): Single primary quality benchmark (WikiText-2 sliding-window perplexity, ctx = 1024, 10 windows). MMLU and NIH are bounded exploratory probes, not full downstream-task campaigns. Sample size for the Mode 1 anomaly (Qwen2-7B INT4 valley) is n = 1 in the present 8-model benchmark; we structure claims accordingly. Decode-latency measurement is at batch size 1 on a single NVIDIA A100. Throughput at larger batch sizes, longer prefixes, or under continuous batching / speculative decoding is not evaluated. Reasoning benchmarks (GSM-8K, MATH), code generation (HumanEval), instruction-following, and the multi-task LongBench / RULER suites are not touched. Post-LN architectures, state-space models, and Transformer variants with very non-standard KV cache structure were not evaluated. Reproducibility. All scripts, raw JSON results, and figures are released under Apache 2.0 at https://github.com/metaSATOKEN/modeq. Every ΔPPL number, compression ratio, and latency number reported in the paper is reproducible from the committed JSON data via a matching script under experiments/. The reference implementation fits in under 50 lines of Python; no custom CUDA kernel is used. Companion work. This paper builds on, but does not depend on, an earlier companion mechanism study at the INT4 tier (Norm-Separated Quantization, Zenodo 10.5281/zenodo.19602981), which characterizes the same norm-separation idea in the INT4 regime with a different model panel. The geometric foundation for the role of the L2 norm in Pre-LN hidden states is treated separately in The Arc and Its Thickness (Zenodo 10.5281/zenodo.19590036). Acknowledgments. This work was conducted with substantial assistance from large language models (Claude, Anthropic) for experiment design, code generation, debugging, data analysis, and manuscript drafting. All experimental results were produced by executing code on real hardware (NVIDIA A100 via Google Colab) and verified by the author. The author takes full responsibility for the scientific claims and any errors in this work.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper

Cite This Study

Kentaro Sato (Fri,) studied this question.

synapsesocial.com/papers/6a12966a48a0ea16656733f4 https://doi.org/https://doi.org/10.5281/zenodo.19724817

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

KI fragen

Bookmark

View Full Paper