What question did this study set out to answer?

To develop an attention mechanism that limits memory usage while maintaining long-range context through innovative memory banks.

February 19, 2026Open Access

CoDA-GQA-L: Bounded-Memory Differential Attention with Value-Routed Landmark Banks

Key Points

To develop an attention mechanism that limits memory usage while maintaining long-range context through innovative memory banks.
Introduced Constrained Orthogonal Differential Attention (CoDA) to improve attention accuracy.
Created a dual-bank memory system combining landmark bank and summary bank for better context retention.
Employed a two-phase training protocol for effective differential attention and memory adaptation.
Achieved up to 37× memory compression per layer on various model sizes.
Demonstrated scale-invariant throughput of approximately 150K tokens/second.
Compression ratios exceeded 1,100× at the largest model scale (70B parameters) with 128K context.

Abstract

We present CoDA-GQA-L, an attention mechanism that provably bounds per-layer KV cache memory to O(W+Me+Ms) independent of sequence length while retaining selective long-range context through dual memory banks. The architecture combines three innovations: (1) Constrained Orthogonal Differential Attention (CoDA), which sharpens attention by subtracting a gated inhibitory stream produced via learnable orthogonal rotation of the signal query, eliminating the second query projection required by prior differential attention; (2) a dual-bank bounded memory comprising an exact landmark bank for high-fidelity token retention and an EMA summary bank for thematic compression; and (3) value-routed semantic matching that ensures position-invariant memory updates despite RoPE-at-write key storage. A two-phase training protocol first teaches differential attention with full context, then adapts the model to bounded memory. Benchmarks across three model scales (Eve-2, 7B, 70B parameters) on NVIDIA H200 demonstrate up to 37× per-layer memory compression, scale-invariant bounded prefill through- put of ∼150K tokens/second regardless of model dimension, and measured compression ratios exceeding 1,100× at 70B scale with 128K context. The bounded state is a fixed-size serializable artifact, enabling a new paradigm of Stateful Neural Databases for agentic retrieval-augmented generation

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper