What question did this study set out to answer?

This research aims to analyze the memory dynamics of key-value caches in various transformer architectures on NVIDIA H100 GPUs.

May 1, 2026Open Access

KVScope: Profiling Cross-Architecture KV-Cache Dynamics on NVIDIA H100

Read Full Paperexternally

Key Points

This research aims to analyze the memory dynamics of key-value caches in various transformer architectures on NVIDIA H100 GPUs.
Profiles four transformer architectures on an H100 80 GB device using KVScope.
Records KV tensor shapes with PyTorch forward hooks and correlates with memory telemetry from NVIDIA Management Library.
Monitors memory behavior across different settings and configurations during autoregressive inference.
Gemma 4 retains 4.7 to 5.3 GB of KV cache after each generation, with a mean leak score of 0.48 (n=15).
gpt-oss-120B shows a bimodal per-layer footprint with a 14.5 GiB difference between reserved and allocated PyTorch pools.
8-bit weight quantization affects perplexity, costing less than 0.25% for smaller models but increasing by +4.6% for Gemma 4.

Abstract

The key-value (KV) cache that underpins autoregressive transformer inference grows linearly with sequence length and dominates GPU memory during long-context generation. This paper introduces KVScope, an instrumentation framework that records per-layer KV tensor shapes via PyTorch forward hooks and correlates them with hardware memory telemetry from the NVIDIA Management Library. We profile four transformer architectures on a single H100 80 GB device: Pythia-1.4B (multi-head attention baseline), Gemma 4 (grouped-query attention with local/global layer interleaving), GLM-4.7-Flash (mixture of experts), and gpt-oss-120B (sliding/full hybrid). Three findings emerge. First, Gemma 4 systematically retains between 4.7 and 5.3 GB of KV cache after every generation (mean leak score 0.48, n=15), invisible to standard fragmentation heuristics. Second, the per-layer footprint of gpt-oss-120B is strongly bimodal (coefficient of variation 0.94), producing a 14.5 GiB gap between PyTorch reserved and allocated pools. Third, 8-bit weight quantisation costs less than 0.25% perplexity for smaller models but +4.6% for Gemma 4. The profiler and dataset are packaged for reproduction.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Rahul Surya

University of Edinburgh

Actions

Institutions

University of Edinburgh

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

KVScope: Profiling Cross-Architecture KV-Cache Dynamics on NVIDIA H100

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study