What question did this study set out to answer?

To develop a framework for robust and interpretable deep reinforcement learning that accounts for adversarial conditions and safety constraints.

February 5, 2026Open Access

RAISE: Robust and Adversarially Informed Safe Explanations for Reinforcement Learning

Key Points

To develop a framework for robust and interpretable deep reinforcement learning that accounts for adversarial conditions and safety constraints.
Introduced the Decomposed Reward NR-MDP for optimal policy learning.
Developed the DRNR-Deep Deterministic Policy Gradient algorithm for learning robust policies.
Utilized a vector-valued value function to generate contrastive explanations based on action choices.
Demonstrated robust performance in dynamic environments like Cliffworld and MuJoCo Hopper.
Provided meaningful, component-level explanations linking safety and efficiency to policy decisions.
Highlighted that ignoring adversarial conditions can mislead interpretability in reinforcement learning.

Abstract

Deep Reinforcement Learning (DRL) policies often exhibit fragility in unseen environments, limiting their deployment in safety-critical applications. While Robust Markov Decision Processes (R-MDPs) enhance control performance by optimizing against worst-case disturbances, the resulting conservative behaviors are difficult to interpret using standard Explainable RL (XRL) methods, which typically ignore adversarial disturbances. To bridge this gap, this paper proposes RAISE (Robust and Adversarially Informed Safe Explanations), a novel framework designed for the Noisy Action Robust MDP (NR-MDP) setting. We first introduce the Decomposed Reward NR-MDP (DRNR-MDP) and the DRNR-Deep Deterministic Policy Gradient (DRNR-DDPG) algorithm to learn robust policies and a vector-valued value function. RAISE utilizes this vectorized value function to generate contrastive explanations (“Why action a instead of b?”), explicitly highlighting the reward components such as safety or energy efficiency prioritized under worst-case attacks. Experiments on a continuous Cliffworld benchmark and the MuJoCo Hopper task demonstrate that the proposed method preserves robust performance under dynamics variations and produces meaningful, component-level explanations that align with intuitive safety and performance trade-offs. Ablation results further show that ignoring worst-case disturbances can substantially alter or invalidate explanations, underscoring the importance of adversarial awareness for reliable interpretability in robust RL.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper