What question did this study set out to answer?

The aim is to establish a framework for monitoring and improving stability in self-improving AI agents.

April 21, 2026Open Access

WhyLab: A Causal Safety Monitoring Framework for Stable Self-Improving Agents

Key Points

The aim is to establish a framework for monitoring and improving stability in self-improving AI agents.
Developed a causal audit framework for unstable regimes.
Created a phase diagram to identify oscillation boundaries in AI policies.
Conducted evaluations using synthetic scenarios and large language model conditions.
Significantly reduced oscillation by 76% under unstable conditions.
Decreased regressions by 44% in adversarial LLM tasks with a fixed sensitivity filter.
Verified that the audit remains inactive in stable situations, as anticipated.

Abstract

Status: NeurIPS 2026 submission under double-blind review. Author identity anonymized. Self-improving AI agents lack runtime safeguards that prevent evaluation drift, fragile outcome acceptance, and unbounded parameter updates from compounding into catastrophic policy degradation. We study cognitive policy oscillation -- strategy degradation caused by hallucinated feedback -- and map an oscillation phase diagram for self-improving agents (384 synthetic + 32 LLM conditions). A sharp instability boundary emerges at moderate step sizes (h approx 0.2), yielding a phase-aware deployment rule. WhyLab: a conditional causal audit framework activating only in the unstable regime: C1: Information-theoretic drift index C2: Sensitivity filter combining E-values and partial R2 bounds C3: Lyapunov-bounded damping controller Our contribution is boundary delineation: identifying when intervention is warranted, not universal improvement. In controlled unstable regimes, the audit reduces oscillation by 76%. On adversarial LLM tasks, fixed C2 reduces regressions by 44% on Gemini 2.0 Flash (p=0.014, Bonferroni-adjusted p=0.042). In the stable regime (SWE-bench Lite, 10,500 episodes), the audit remains inactive, as predicted. Docker evaluations on Gemini 2.0/2.5 Flash show zero observed C2-caused regressions. Change log (v2 vs v1): Abstract condensed to boundary-delineation framing (honest null-result acknowledgement); C2 targeted SWE-bench selective follow-up transparently reported (no net gain vs fixed C2); Docker Gemini 2.5 Flash full Docker evaluation added; phase-aware deployment rule formalized; references and deployment checklist expanded.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Anonymous Author

Actions

Institutions

American Foundation for the Blind

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

WhyLab: A Causal Safety Monitoring Framework for Stable Self-Improving Agents

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider