What question did this study set out to answer?

This research aims to address the accumulation of numerical errors in persistent systems utilizing bounded context depths.

May 15, 2026Open Access

One Card, One Stack: Constraint-Driven Architecture for Asymptotically Stable Inference over Unbounded Agent Memory

Key Points

This research aims to address the accumulation of numerical errors in persistent systems utilizing bounded context depths.
Developed an online self-learning Markov expert prediction system with DMA offload for efficient inference.
Designed an adaptive KV quantization family accommodating multiple compression modes for optimized performance.
Implemented attentional provenance indexing and a three-tier paged context for dynamic resource utilization.
Achieved 509 t/s for single-context operations, outperforming benchmarks by 2.6–3.4× on RTX 4090 with standard frameworks.
Achieved 2,446 t/s across 64 concurrent sessions using a mobile RTX 4090, surpassing existing models.
Demonstrated constant bounded numerical error per generation step under varying context depths.

Abstract

Abstract Persistent agentic systems require context that grows without bound. Under standard full attention, numerical error per generation step grows monotonically with context depth — for any finite-precision arithmetic, any compression scheme, on any hardware — because every token participates in every subsequent computation with equal structural weight. This is not a compression problem; it is an architectural one. An H100 at 80GB enters the same accumulation regime as a 16GB GPU the moment any token is compressed or evicted — it is simply deferred. We prove that the accumulation problem is architectural rather than representational, and that bounded error accumulation at unbounded context depth requires decoupling the working set from context depth. We present the first complete system implementing this requirement. The key theoretical result (Theorem §11.2 — Asymptotic Numerical Stability): under provenance-selected attention over a tiered context, total numerical error per generation step — from any source, including floating-point rounding — is bounded by a constant O(1) independent of context depth N, in contrast with the O(N) scaling of standard full-attention systems. Under practical system conditions (warm-tier blocks originating from prefill-refreshed hot-tier blocks) this constant is small, approaching the hot-tier error floor. This inverts the universal assumption of the KV quantization literature that error grows with N. The system is built on four integrated contributions: (1) an online self-learning Markov expert prediction system with DMA offload and wave-batched grouped GEMM achieving stall-free MoE inference under partial VRAM residency; (2) an adaptive per-block KV quantization family spanning FP16 to 2-bit integer with boundary-aware sub-block structure, two-phase prefill refresh eliminating autoregressive decode drift, and per-block selection across ten compression modes ranging from 1.21× (top-quality tier) to 4.67× (highest-compression tier) per-head — with asymmetric K/V error metrics matched to the softmax-amplified K and linear-bounded V error propagation paths — the production-achievable range given the attention kernel's per-head gather constraint — using asymmetric K/V thresholds grounded in the softmax error amplification asymmetry, with overall system compression ratio dependent on block-level mode distribution; (3) attentional provenance indexing via Q-vector cognitive-state fingerprints with Speculative Context Decode — a pipelined two-session generation loop that hides CPU provenance scoring (3–10ms flat scan) behind a parallel variable-window probe session terminating at newline boundaries, yielding working-set refinement at the model's natural reasoning granularity with near-zero visible overhead ; and (4) an unbounded three-tier paged context (VRAM-hot, CPU RAM-warm, disk-cold) with adaptive quantization calibrated to the asymptotic guarantee. Each contribution originated from a hard constraint that closed the standard solution and forced an architectural choice that turns out to be universally correct. Implemented in Rust on a custom Candle fork with native quantized matmul kernels that never materialise a full-precision weight copy, the system demonstrates 509 t/s single-context — 2.6–3.4× faster than community benchmarks for this model on RTX 4090 24GB with standard single-session frameworks hardware-corner.net, 2025; ToolHalla, 2026 — and 2,446 t/s aggregate across 64 concurrent persistent-memory sessions on an RTX 4090 Mobile (16GB). The concurrent-session figure reflects server throughput across 64 simultaneous agents; no standard framework runs this model on 16GB at comparable concurrency. An evaluation methodology is described in §9.12 using the system's own 2.2M-line Rust/CUDA Candle fork as the test subject: the system is ingested into unbounded context via a ~20M-token learning-phase conversation, then queried via iterative multi-hop retrieval during decode. The one-shot ablation — same index, single pre-generation retrieval — isolates the contribution of continuous decode-time retrieval. Quantitative results are reserved for v2, which will incorporate community validation and independent optimization. The working system is publicly available for live verification and collaborative development (Appendix C).

Read Full Paperexternally

AI से पूछें

Bookmark

View Full Paper