As LLMs scale toward million-token contexts, KV cache memory becomes the dominant bottleneck. Existing pruning methods like Top-K eviction discard tokens based on current attention scores — an assumption that leads to unpredictable reconstruction failures at structurally important positions. This paper proposes the SRC (Selection-Reconstruction-Compression) pipeline, which summarizes rather than discards tokens. Low-salience, high-entropy tokens are routed to a Recycle Bin, reconstructed via OLS against the current query matrix, and compressed into compact centroid tokens using SVD. Experiments show HAE achieves up to 3× lower reconstruction error than Top-K at a 30% keep ratio while using less total memory.
Building similarity graph...
Analyzing shared references across papers
Loading...
Jayanth Chandra (Sun,) studied this question.
synapsesocial.com/papers/69e866896e0dea528ddeaeed — DOI: https://doi.org/10.5281/zenodo.19657329
Jayanth Chandra
Building similarity graph...
Analyzing shared references across papers
Loading...