We present CoDA-GQA-L, an attention mechanism that provably bounds per-layer KV cache memory to O(W+Me+Ms) independent of sequence length while retaining selective long-range context through dual memory banks. The architecture combines three innovations: (1) Constrained Orthogonal Differential Attention (CoDA), which sharpens attention by subtracting a gated inhibitory stream produced via learnable orthogonal rotation of the signal query, eliminating the second query projection required by prior differential attention; (2) a dual-bank bounded memory comprising an exact landmark bank for high-fidelity token retention and an EMA summary bank for thematic compression; and (3) value-routed semantic matching that ensures position-invariant memory updates despite RoPE-at-write key storage. A two-phase training protocol first teaches differential attention with full context, then adapts the model to bounded memory. Benchmarks across three model scales (Eve-2, 7B, 70B parameters) on NVIDIA H200 demonstrate up to 37× per-layer memory compression, scale-invariant bounded prefill through- put of ∼150K tokens/second regardless of model dimension, and measured compression ratios exceeding 1,100× at 70B scale with 128K context. The bounded state is a fixed-size serializable artifact, enabling a new paradigm of Stateful Neural Databases for agentic retrieval-augmented generation
Building similarity graph...
Analyzing shared references across papers
Loading...
Anthony Maio
Mining Institute
Building similarity graph...
Analyzing shared references across papers
Loading...
Anthony Maio (Mon,) studied this question.
www.synapsesocial.com/papers/6996a7c3ecb39a600b3edccc — DOI: https://doi.org/10.5281/zenodo.18663264