What question did this study set out to answer?

This work aims to understand the KV-cache performance across different architectures using KVScope V2.

May 31, 2026Open Access

KVScope V2: Cross-Architecture KV-Cache Profiling on NVIDIA H200

Key Points

This work aims to understand the KV-cache performance across different architectures using KVScope V2.
Profiles five model families across four attention paradigms on NVIDIA H200.
Standardized 2048-token generation budget utilized for evaluation.
Measured KV-cache density and layer heterogeneity among models.
KV-cache density varies significantly, ranging from 1.12 to 14.67 KB/token/layer.
Gemma 4 achieves the highest cache density of 14.67 KB/token/layer (CV of 0.203).
gpt-oss demonstrates high layer heterogeneity (CV = 0.902), leading to 47% memory waste.

Abstract

Key–value (KV) caches account for the dominant growth term in autoregressive inference memory, yet per-architecture footprint fingerprints remain absent from most model cards. We apply KVScope V2 to five model families spanning four attention paradigms on an NVIDIA H200 (143,771 MiB VRAM, Hopper architecture). All models are profiled under a standardised 2048-token generation budget (1024 for Pythia-1.4B). Three results emerge: (1) KV-cache density is a stable, architecture-determined fingerprint spanning a 13× range (1.12 to 14.67 KB/token/layer); (2) Gemma 4's local/global hybrid produces the highest density at 14.67 KB/token/layer with a per-layer coefficient of variation of 0.203; (3) gpt-oss exhibits extreme layer heterogeneity (CV = 0.902), such that a uniform paged allocator wastes 47% of allocated memory on this architecture. Nemotron-H is retained as a state-space-model control confirming O(1) memory trajectory.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Rahul Surya

University of Edinburgh

Actions

Institutions

University of Edinburgh

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

KVScope V2: Cross-Architecture KV-Cache Profiling on NVIDIA H200

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study