The key-value (KV) cache that underpins autoregressive transformer inference grows linearly with sequence length and dominates GPU memory during long-context generation. This paper introduces KVScope, an instrumentation framework that records per-layer KV tensor shapes via PyTorch forward hooks and correlates them with hardware memory telemetry from the NVIDIA Management Library. We profile four transformer architectures on a single H100 80 GB device: Pythia-1.4B (multi-head attention baseline), Gemma 4 (grouped-query attention with local/global layer interleaving), GLM-4.7-Flash (mixture of experts), and gpt-oss-120B (sliding/full hybrid). Three findings emerge. First, Gemma 4 systematically retains between 4.7 and 5.3 GB of KV cache after every generation (mean leak score 0.48, n=15), invisible to standard fragmentation heuristics. Second, the per-layer footprint of gpt-oss-120B is strongly bimodal (coefficient of variation 0.94), producing a 14.5 GiB gap between PyTorch reserved and allocated pools. Third, 8-bit weight quantisation costs less than 0.25% perplexity for smaller models but +4.6% for Gemma 4. The profiler and dataset are packaged for reproduction.
Building similarity graph...
Analyzing shared references across papers
Loading...
Rahul Surya
University of Edinburgh
University of Edinburgh
Building similarity graph...
Analyzing shared references across papers
Loading...
Rahul Surya (Wed,) studied this question.
synapsesocial.com/papers/69f4443a967e944ac55673cb — DOI: https://doi.org/10.5281/zenodo.19871038