Most existing adversarial training methods in reinforcement learning (RL) offer limited robustness and remain vulnerable to novel attacks. To address this limitation, an approach that enhances policy robustness by leveraging the historically optimal policy to guide policy optimization and generating diverse adversarial perturbations, termed robust RL via leveraging historically optimal policy with regulation of performance (HORP), is proposed. Unlike other approaches that rely solely on trial-and-error interactions, HORP constructs a guidance value function by simultaneously considering value gaps and policy distribution divergence, thereby focusing on prioritized learning in promising action spaces. It also incorporates an adaptive performance-aware optimization mechanism to trigger timely corrections, preventing the agent from deviating from optimal performance. Furthermore, HORP dynamically modulates perturbation entropy through controlled uncertainty injection, thereby improving the agent's generalized defensive capabilities. Experiments demonstrate that HORP achieves superior performance in most cases regarding both natural performance and robustness against various state attacks.
Chen et al. (Thu,) studied this question.