What question did this study set out to answer?

This research aims to validate a theoretical framework based on Lyapunov stability for monitoring task failure in multi-agent AI systems.

June 18, 2026Open Access

Empirical Lyapunov Stability: Growth-Ratio Energy Functions as Leading Indicators of Agent Task Failure

Key Points

This research aims to validate a theoretical framework based on Lyapunov stability for monitoring task failure in multi-agent AI systems.
Conducted a 5-condition ablation study with 3,175 total runs to isolate contributions of each mechanism.
Implemented a hybrid Rust/Python runtime safety library called state-harness and tested it across four benchmarks.
Performed multi-trial validation with 333 SWE-bench runs for statistical robustness.
Achieved zero stability violations across 1,886 runs on short/medium-loop benchmarks with <2% overhead.
Full-stack monitoring on long-loop benchmarks resulted in 38.6% compute reduction and eliminated burnout events.
Confirmed zero false positives across 80 harness runs in local model validation, with naive turn-limiting outperforming unconstrained baselines by +17.5pp.

Abstract

In our prior theoretical work, we proposed a physics-inspired framework for governing the semantic boundary layer of multi-agent AI systems, drawing on Lyapunov stability theory, Renormalization Group compression, and Vector Symbolic Architectures. That framework was a theoretical edifice; mathematically grounded but empirically unverified. This paper presents its empirical validation through a 5-condition ablation study (3,175 total runs) isolating each mechanism's contribution, with multi-trial validation (333 SWE-bench runs) confirming statistical robustness and cross-model validation across 5 model families including 4 open-weight local models. We implement the proposed framework as state-harness, a hybrid Rust/Python runtime safety library, and evaluate it across four complementary benchmarks: τ³-bench (customer-service agents, 750 runs), SWE-bench Verified (software engineering agents, 481 runs), MINT (multi-turn reasoning and coding, 1,136 runs), and a custom local-model battery (808 runs across 4 open-weight model families via Ollama on consumer hardware). Our central empirical finding is that the naive Lyapunov energy function V(k) = S(k) + λθ(k) produces unacceptable false positive rates (46%) because multi-turn conversations naturally exhibit ΔV ≥ 0 as context windows accumulate. We resolve this through growth-ratio normalization: monitoring the ratio V̂(k) = S(k)/S̄ against a warmup baseline rather than raw token counts. This normalization transforms an unstable diagnostic signal into a precise leading indicator of task failure. Our 5-condition ablation (Baseline → Lyapunov-only → Lyapunov+RG → Full-stack → Naive Cap) reveals four principal results: (1) on short/medium-loop benchmarks (MINT + τ³), the monitor achieves zero stability violations across 1,886 runs with <2% computational overhead; (2) on long-loop benchmarks (SWE-bench), full-stack monitoring achieves 38.6% compute reduction and 30% wall-time reduction while eliminating all max-budget burnout events; (3) multi-trial evaluation (333 SWE-bench runs) confirms all resolve-rate differences between conditions fall within the ±4–5% LLM nondeterminism band; (4) local model validation across 4 open-weight model families (Llama 3.2:3B, Phi-4-Mini, Qwen3:4B, Gemma4:E4B) confirms zero false positives across 80 harness runs and reveals a novel small-model self-sabotage pattern where naive turn-limiting outperforms unconstrained baselines by +17.5pp on average. The implementation is released as open-source: github.com/vishal-dehurdle/state-harness. Install via PyPI: pip install state-harness.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper