What question did this study set out to answer?

This research aims to develop a new KV cache compression method that separates the treatment of Keys and Values to improve performance.

March 6, 2026Open Access

🚀 TTAS-X2: Extreme KV Cache Compression for Large Language Models

Key Points

This research aims to develop a new KV cache compression method that separates the treatment of Keys and Values to improve performance.
Introduced a 4-bit scalar quantization for Keys (K)
Implemented hierarchical product quantization with Hadamard transform for Values (V)
Maintained directional sensitivity for K while allowing aggressive compression for V
Tested the model Qwen3-32B across various context lengths
Achieved attention cosine similarity scores ≥ 0.8 across all layers
Attained a compression ratio greater than 10× for KV cache
Demonstrated the capability to run 32B models on devices with only 8GB of RAM

Abstract

We introduce TTAS-X2, a novel KV cache compression method that treats Keys (K) and Values (V) with fundamentally different strategies based on their functional roles in attention. While previous approaches compress K and V uniformly, we show that: K is direction‑sensitive and must be preserved with high fidelity V tolerates aggressive compression without harming attention scores By applying 4‑bit scalar quantization to K and hierarchical product quantization (256 → 64) with Hadamard transform and fixed‑budget outliers to V, we achieve: Attention cosine similarity ≥ 0. 8 across all layers of Qwen3-32B Compression ratio > 10× for KV cache BPW < 1. 6 while maintaining functional behavior This work enables running 32B‑class models on devices with 8GB RAM, previously impossible. 🧠 1. Introduction KV cache is the main memory bottleneck during long‑context inference. Standard quantization (2‑3 bits) degrades attention scores severely because K is distorted, and softmax is highly sensitive to directional changes. We propose a functional separation: K: 4‑bit scalar + per‑head scale V: RPQ + Hadamard + fixed‑budget outliers ⚙️ 2. Methodology 2. 1 K Compression For each head, we store a 4‑bit quantized version with a per‑head scale. No Hadamard, no PQ, no residuals – direction is preserved. 2. 2 V Compression RMS normalize per vector Hadamard transform to spread energy Two‑level product quantization (K1=256, K2=64–512) Fixed‑budget outliers (32–256 per head) 📊 3. Experiments Model: Qwen3-32B (headdim = 128, 64 layers, GQA with 8 KV heads) Sequence length: 1024 tokens per test Layer Cosine OUTLIERBUDGET K2 0 0. 824 32 64 1 0. 804 64 128 2 0. 803 128 256 4 0. 806 256 512 All layers achieve cosine ≥ 0. 8, proving that TTAS-X2 generalizes across depth. 💾 4. Memory Analysis For a 32B model with 64 layers and GQA (8 KV heads): Context Length Original (FP16) TTAS-X2 Ratio BPW 4k 1. 07 GB 0. 17 GB 6. 3× 1. 59 16k 4. 29 GB 0. 68 GB 6. 3× 1. 59 32k 8. 59 GB 1. 36 GB 6. 3× 1. 59 64k 17. 18 GB 2. 72 GB 6. 3× 1. 59 TTAS-X2 enables 32k context on 8GB devices for the first time. 🔬 5. Conclusion We demonstrate that functional separation of K and V is the key to extreme KV cache compression. TTAS-X2 preserves attention fidelity while reducing memory footprint by an order of magnitude. Patent pending – all rights reserved. 📎 6. References Multiverse Computing, "CompactifAI" (2025) Intel, "KV Cache Compression via Token Pruning" (2024) Original TTAS work (Alsheck, 2025) 📬 Contact Abdulaziz Alsheck📧 abdulazizabdullahalalsheck@gmail. com Phone Number +966560756695

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

alsheck abdulaziz

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

🚀 TTAS-X2: Extreme KV Cache Compression for Large Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider