Abstract: We introduce L-Dynamic Attention, a learned Key-Value (KV) cache management mechanism designed to enable efficient long-context Transformer inference. In modern Large Language Models (LLMs), processing extended sequences suffers from linear KV cache growth and massive memory bottlenecks. Shifting away from rigid, hand-crafted pruning heuristics, our framework assigns each token a dynamic scalar utility score wⱼ ∈ 0, 1, estimated via a lightweight predictor trained on key embeddings, positional tokens, and local context. These individual utility scores are combined with token age (temporal entropy tⱼ) into a unified viability function vⱼ = wⱼ² / (tⱼᵃge + ε), directly instantiating the foundational Lt-parameter framework (v = L²/t) within deep learning architectures. Under an adaptive percentile thresholding eviction policy, non-essential tokens are systematically collapsed to maintain a strict memory budget. Empirical evaluations on LLaMA-2 7B (up to 32k context lengths) demonstrate up to a 10x memory footprint reduction with negligible accuracy degradation (<1. 5% perplexity increase on PG19, and <3% drop in long-context retrieval accuracy). We provide a rigorous theoretical interpretation of this mechanism as an approximate solution to a constrained memory optimization problem under an evolutionary survival-process paradigm.
Stanislav Usychenko (Tue,) studied this question.