What question did this study set out to answer?

This research aims to develop an efficient memory architecture for long-context reasoning in language models.

May 29, 2026Open Access

A Conditional Small-State Memory Architecture for Efficient Long-Context Reasoning

Key Points

This research aims to develop an efficient memory architecture for long-context reasoning in language models.
Designed Anamnesis using Retentive Networks, Block Attention Residuals, and Hashed N-gram memory.
Implemented a complete proof-aligned architecture to maintain performance without a full KV cache.
Improved memory-bound recall by utilizing advanced techniques such as isotropic gating and depth axis attention.
Showed significant enhancement in decision-critical state recall with reduced memory costs.
Achieved effective decoupling of time, depth, and pattern memory axes leading to better information processing.
Demonstrated a robust performance increase compared to traditional architectures.

Abstract

This deposit contains the official PyTorch research scaffold, unit test suites, and formal mathematical proofs for Anamnesis—a resource-rational, budgeted long-context memory architecture combining Retentive Networks (RetNet), Block Attention Residuals (AttnRes), and Hashed N-gram Engram memory. Long-context language models must balance the low marginal cost of sequence streaming with high-fidelity exact recall of decision-critical states. This repository presents a complete proof-aligned implementation designed to explore budgeted memory boundaries without relying on a full Transformer Key- Value (KV) cache. Core Architectural Components 1. Time Axis (RetNet Streaming + Bounded Snapshots): Default sequence streaming handled by contractive recurrent states, augmented with a capped snapshot cache (K_) and an output-side vocabulary-logit projection readout to prevent recurrent decay and superposition noise. 2. Depth Axis (Zero-Parameter Block Attention Residuals): Replaces traditional heavily-parameterized cross-layer projections with pure softmax attention over preceding raw block states using a single learned pseudo-query parameter wₗ Rᵈ and an age-based distance penalty. 3. Local Pattern Axis (Hashed N-gram Engram): High-entropy, deterministic local pattern memory built with multi-head N-gram hash tables, isotropic scalar gating, and a dilated causal 1D convolution layer to expand the local receptive field.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper