We introduce TTAS-X2, a novel KV cache compression method that treats Keys (K) and Values (V) with fundamentally different strategies based on their functional roles in attention. While previous approaches compress K and V uniformly, we show that: K is direction‑sensitive and must be preserved with high fidelity V tolerates aggressive compression without harming attention scores By applying 4‑bit scalar quantization to K and hierarchical product quantization (256 → 64) with Hadamard transform and fixed‑budget outliers to V, we achieve: Attention cosine similarity ≥ 0. 8 across all layers of Qwen3-32B Compression ratio > 10× for KV cache BPW < 1. 6 while maintaining functional behavior This work enables running 32B‑class models on devices with 8GB RAM, previously impossible. 🧠 1. Introduction KV cache is the main memory bottleneck during long‑context inference. Standard quantization (2‑3 bits) degrades attention scores severely because K is distorted, and softmax is highly sensitive to directional changes. We propose a functional separation: K: 4‑bit scalar + per‑head scale V: RPQ + Hadamard + fixed‑budget outliers ⚙️ 2. Methodology 2. 1 K Compression For each head, we store a 4‑bit quantized version with a per‑head scale. No Hadamard, no PQ, no residuals – direction is preserved. 2. 2 V Compression RMS normalize per vector Hadamard transform to spread energy Two‑level product quantization (K1=256, K2=64–512) Fixed‑budget outliers (32–256 per head) 📊 3. Experiments Model: Qwen3-32B (headdim = 128, 64 layers, GQA with 8 KV heads) Sequence length: 1024 tokens per test Layer Cosine OUTLIERBUDGET K2 0 0. 824 32 64 1 0. 804 64 128 2 0. 803 128 256 4 0. 806 256 512 All layers achieve cosine ≥ 0. 8, proving that TTAS-X2 generalizes across depth. 💾 4. Memory Analysis For a 32B model with 64 layers and GQA (8 KV heads): Context Length Original (FP16) TTAS-X2 Ratio BPW 4k 1. 07 GB 0. 17 GB 6. 3× 1. 59 16k 4. 29 GB 0. 68 GB 6. 3× 1. 59 32k 8. 59 GB 1. 36 GB 6. 3× 1. 59 64k 17. 18 GB 2. 72 GB 6. 3× 1. 59 TTAS-X2 enables 32k context on 8GB devices for the first time. 🔬 5. Conclusion We demonstrate that functional separation of K and V is the key to extreme KV cache compression. TTAS-X2 preserves attention fidelity while reducing memory footprint by an order of magnitude. Patent pending – all rights reserved. 📎 6. References Multiverse Computing, "CompactifAI" (2025) Intel, "KV Cache Compression via Token Pruning" (2024) Original TTAS work (Alsheck, 2025) 📬 Contact Abdulaziz Alsheck📧 abdulazizabdullahalalsheck@gmail. com Phone Number +966560756695
Building similarity graph...
Analyzing shared references across papers
Loading...
alsheck abdulaziz
Building similarity graph...
Analyzing shared references across papers
Loading...
alsheck abdulaziz (Wed,) studied this question.
www.synapsesocial.com/papers/69aa70f8531e4c4a9ff5b3aa — DOI: https://doi.org/10.5281/zenodo.18858709
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: