The key-value (KV) cache that underpins autoregressive transformer inference grows linearly with sequence length and dominates GPU memory during long-context generation. This paper introduces KVScope, an instrumentation framework that records per-layer KV tensor shapes via PyTorch forward hooks and correlates them with hardware memory telemetry from the NVIDIA Management Library. We profile four transformer architectures on a single H100 80 GB device: Pythia-1.4B (multi-head attention baseline), Gemma 4 (grouped-query attention with local/global layer interleaving), GLM-4.7-Flash (mixture of experts), and gpt-oss-120B (sliding/full hybrid). Three findings emerge. First, Gemma 4 systematically retains between 4.7 and 5.3 GB of KV cache after every generation (mean leak score 0.48, n=15), invisible to standard fragmentation heuristics. Second, the per-layer footprint of gpt-oss-120B is strongly bimodal (coefficient of variation 0.94), producing a 14.5 GiB gap between PyTorch reserved and allocated pools. Third, 8-bit weight quantisation costs less than 0.25% perplexity for smaller models but +4.6% for Gemma 4. The profiler and dataset are packaged for reproduction.
Rahul Surya (Wed,) studied this question.