What question did this study set out to answer?

This research aims to enhance KV-cache compression by optimizing quantization methods to reduce logit distortion.

April 14, 2026Open Access

SMAQ: Workload-Aware Vector Quantization for KV Cache Compression

Key Points

This research aims to enhance KV-cache compression by optimizing quantization methods to reduce logit distortion.
Developed SMAQ to apply log-compressed spectral shaping to query covariance before quantization.
Evaluated performance on offline TinyLlama-1.1B KV traces with adaptive codec and varying configurations.
Conducted cross-architecture validation to assess stability across different attention topologies.
SMAQ achieved a logit MSE reduction of 5.2% to 14.1% across layers compared to standard vector quantization.
In a production setup, the inclusion of SMAQ metrics reduced perplexity from 10.48 to 10.81 at 4.83× compression.
Cross-architecture tests confirmed consistent performance improvements across various models.

Abstract

KV-cache compression is increasingly constrained not only by storage budget but by how quantization error perturbs downstream attention logits. Existing preprocessing-based methods typically apply orthogonal transforms before quanti- zation, motivated by improved coordinate marginals. We show that this intuition breaks down for fully adaptive block vector quantizers: once the codebook is learned in the transformed space, orthogonal preprocessing alone does not improve the optimal distortion. We therefore study KV quantization through the down- stream logit objective and derive a query-aware metric from the calibration query covariance. While the induced Mahalanobis geometry is optimal in the high-rate limit, it becomes unstable at low bitrate because extreme eigenvalue spreads over- concentrate codebook capacity along a few dominant directions. We address this failure mode with Spectral Metric-Aware Quantization (SMAQ), which applies log-compressed spectral shaping to the query covariance before quantization, pre- serving task-relevant anisotropy while regularizing the effective condition number. On offline TinyLlama-1.1B KV traces with 8D blocks and 256 centroids, SMAQre- duces held-out logit MSE by 5.2%–14.1% across tested layers, with an 8.3% mean reduction over standard vector quantization, while orthogonal baselines yield no measurable gain under the same adaptive codec. We further validate SMAQ end- to-end on production-scale models via an MLX autoregressive caching framework: on Qwen3.5-9B with offline-calibrated Σq , folding the SMAQ metric into Turbo- Quant’s orthogonal codec achieves 10.81 perplexity (vs. 10.48 uncompressed) at 4.83×compression, closing the perplexity gap by 97% relative to prompt-time dynamic calibration. Cross-architecture validation on Gemma-4-E4B confirms stability across sliding-window and KV-shared attention topologies. Additional appendix experiments on synthetic condition-ratio sweeps, an expanded TinyL- lama layer study, and Qwen2.5-0.5B support the same qualitative conclusion. We position SMAQas a calibrated, workload-aware compression method particularly well suited to vertical deployments with stable query distributions.

Read Full Paperexternally

Perguntar à IA

Bookmark

View Full Paper