Key–value (KV) caches account for the dominant growth term in autoregressive inference memory, yet per-architecture footprint fingerprints remain absent from most model cards. We apply KVScope V2 to five model families spanning four attention paradigms on an NVIDIA H200 (143,771 MiB VRAM, Hopper architecture). All models are profiled under a standardised 2048-token generation budget (1024 for Pythia-1.4B). Three results emerge: (1) KV-cache density is a stable, architecture-determined fingerprint spanning a 13× range (1.12 to 14.67 KB/token/layer); (2) Gemma 4's local/global hybrid produces the highest density at 14.67 KB/token/layer with a per-layer coefficient of variation of 0.203; (3) gpt-oss exhibits extreme layer heterogeneity (CV = 0.902), such that a uniform paged allocator wastes 47% of allocated memory on this architecture. Nemotron-H is retained as a state-space-model control confirming O(1) memory trajectory.
Building similarity graph...
Analyzing shared references across papers
Loading...
Rahul Surya
University of Edinburgh
University of Edinburgh
Building similarity graph...
Analyzing shared references across papers
Loading...
Rahul Surya (Fri,) studied this question.
synapsesocial.com/papers/6a1bd2675783ba022b6fddb8 — DOI: https://doi.org/10.5281/zenodo.20445398