What question did this study set out to answer?

The aim is to develop a framework to evaluate the stability of reinforcement learning policies under stress conditions.

April 6, 2026Open Access

ARCUS-H: Behavioral Stability Under Controlled Stress as a Complementary RL Evaluation Axis

Key Points

The aim is to develop a framework to evaluate the stability of reinforcement learning policies under stress conditions.
Developed ARCUS-H framework for post-hoc evaluation of RL policies.
Applied structured perturbations like sensor noise and reward corruption to test agents.
Evaluated agents across five channels of stability: competence, policy consistency, temporal stability, observation reliability, and action entropy.
Generated approximately 1 million evaluation episodes across various environments and algorithms.
Reward accounted for only 5.7% of variance in behavioral stability, indicating limited explanatory power.
SAC agents showed significantly greater fragility compared to TD3 agents when exposed to observation noise.
MuJoCo agents displayed the highest instability while maintaining strong nominal performance.

Abstract

ARCUS-H: Behavioral Stability Evaluation Under Stress for RL ARCUS-H (Adaptive Reinforcement Coherence Under Stress Harness) is a post-hoc evaluation framework for reinforcement learning policies that measures behavioral stability under controlled stress conditions. Standard RL evaluation relies on expected return, which can hide fragility. ARCUS-H applies structured perturbations (sensor noise, actuator disruption, reward corruption) to trained Stable-Baselines3 agents — without retraining or model access — and evaluates stability across five interpretable channels: Competence Policy Consistency Temporal Stability Observation Reliability Action Entropy Divergence These are combined into a composite stability score with an adaptive per-run threshold (mean FPR ≈ 6.1% for target α = 0.05, no environment-specific tuning). Scale This release includes: 51 (environment, algorithm) pairs 12 environments, 8 algorithms 8 stress schedules 10 seeds per configuration → ~1M evaluation episodes (979,200 total) Key Highlights Reward explains only 5.7% of behavioral stability variance (r = 0.24) SAC shows significantly higher fragility than TD3 under observation noise MuJoCo agents exhibit highest instability despite strong nominal performance CNN robustness varies by learned representation, not architecture ARCUS and CVaR capture complementary robustness dimensions Links GitHub: https://github.com/karimzn00/ARCUSH Lab: https://nuraql.com

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper