KV-cache compression is increasingly constrained not only by storage budget but by how quantization error perturbs downstream attention logits. Existing preprocessing-based methods typically apply orthogonal transforms before quanti- zation, motivated by improved coordinate marginals. We show that this intuition breaks down for fully adaptive block vector quantizers: once the codebook is learned in the transformed space, orthogonal preprocessing alone does not improve the optimal distortion. We therefore study KV quantization through the down- stream logit objective and derive a query-aware metric from the calibration query covariance. While the induced Mahalanobis geometry is optimal in the high-rate limit, it becomes unstable at low bitrate because extreme eigenvalue spreads over- concentrate codebook capacity along a few dominant directions. We address this failure mode with Spectral Metric-Aware Quantization (SMAQ), which applies log-compressed spectral shaping to the query covariance before quantization, pre- serving task-relevant anisotropy while regularizing the effective condition number. On offline TinyLlama-1.1B KV traces with 8D blocks and 256 centroids, SMAQre- duces held-out logit MSE by 5.2%–14.1% across tested layers, with an 8.3% mean reduction over standard vector quantization, while orthogonal baselines yield no measurable gain under the same adaptive codec. We further validate SMAQ end- to-end on production-scale models via an MLX autoregressive caching framework: on Qwen3.5-9B with offline-calibrated Σq , folding the SMAQ metric into Turbo- Quant’s orthogonal codec achieves 10.81 perplexity (vs. 10.48 uncompressed) at 4.83×compression, closing the perplexity gap by 97% relative to prompt-time dynamic calibration. Cross-architecture validation on Gemma-4-E4B confirms stability across sliding-window and KV-shared attention topologies. Additional appendix experiments on synthetic condition-ratio sweeps, an expanded TinyL- lama layer study, and Qwen2.5-0.5B support the same qualitative conclusion. We position SMAQas a calibrated, workload-aware compression method particularly well suited to vertical deployments with stable query distributions.
Building similarity graph...
Analyzing shared references across papers
Loading...
GAURAV SAINI
Building similarity graph...
Analyzing shared references across papers
Loading...
GAURAV SAINI (Tue,) studied this question.
www.synapsesocial.com/papers/69ddd9e1e195c95cdefd74ac — DOI: https://doi.org/10.5281/zenodo.19537052