This paper introduces the Think-Answer Quantization Gap (TAQG), a theoretical framework proving that uniform KV cache quantization is provably suboptimal for large reasoning models whenever think-phase and answer-phase tokens differ in pairwise cosine redundancy. The framework is direction-agnostic: it prescribes fewer bits for whichever phase exhibits higher redundancy. Empirical validation on DeepSeek-R1-Distill-Qwen-1.5B reveals a surprising model-size-dependent redundancy reversal, where answer-phase tokens exhibit higher redundancy than think-phase tokens - opposite to findings on the full 671B model. Code and experimental data are included.
Building similarity graph...
Analyzing shared references across papers
Loading...
Raviteja Nekkalapu
Building similarity graph...
Analyzing shared references across papers
Loading...
Raviteja Nekkalapu (Fri,) studied this question.
www.synapsesocial.com/papers/69db36c24fe01fead37c4cba — DOI: https://doi.org/10.5281/zenodo.19500603