What question did this study set out to answer?

The aim is to establish a theoretical framework, the Think-Answer Quantization Gap, for optimizing KV cache quantization in large reasoning models.

April 12, 2026Open Access

Think Less, Store Smarter: A Theoretical Framework for Type-Aware KV Cache Quantization in Large Reasoning Models

Key Points

The aim is to establish a theoretical framework, the Think-Answer Quantization Gap, for optimizing KV cache quantization in large reasoning models.
Introduced the Think-Answer Quantization Gap (TAQG) framework.
Proved the suboptimality of uniform KV cache quantization under certain conditions.
Validated the framework using DeepSeek-R1-Distill-Qwen-1.5B model.
Found that answer-phase tokens showed higher cosine redundancy than think-phase tokens in the tested model.
Observed a model-size-dependent reversal in token redundancy compared to findings on the larger 671B model.

Abstract

This paper introduces the Think-Answer Quantization Gap (TAQG), a theoretical framework proving that uniform KV cache quantization is provably suboptimal for large reasoning models whenever think-phase and answer-phase tokens differ in pairwise cosine redundancy. The framework is direction-agnostic: it prescribes fewer bits for whichever phase exhibits higher redundancy. Empirical validation on DeepSeek-R1-Distill-Qwen-1.5B reveals a surprising model-size-dependent redundancy reversal, where answer-phase tokens exhibit higher redundancy than think-phase tokens - opposite to findings on the full 671B model. Code and experimental data are included.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Raviteja Nekkalapu (Fri,) studied this question.

synapsesocial.com/papers/69db36c24fe01fead37c4cba https://doi.org/https://doi.org/10.5281/zenodo.19500603

Bookmark

View Full Paper