Abstract Conventional penalty function-based safe reinforcement learning (RL) algorithms often handle safety and policy difference constraints separately, lacking specialized mechanisms to address scenarios where both constraints are violated simultaneously. This limitation sometimes results in severe performance degradation, underscoring the need for further investigation. To address this issue, a simple yet efficient safe RL algorithm is proposed in this work called coupled penalty-based safe policy optimization (CPSPO). CPSPO operates within the first-order local policy search framework, ensuring ease of implementation in practice. To enhance behavior correction when both constraints are violated, CPSPO introduces coupled penalties, considering the cost and Kullback-Leibler (KL)-divergence constraint violation situation simultaneously, effectively improving both constraints handling. Moreover, instead of the conventional policy ratio clipping mechanism, CPSPO directly incorporates the KL-divergence constraint in the loss function design, providing a more intuitive and practical approach to proximal safe policy optimization. Extensive experiments demonstrate that CPSPO outperforms mainstream safe RL algorithms in reward optimization and safety constraint satisfaction.
Pang et al. (Fri,) studied this question.