August 1, 2025Open Access

Coupled Penalties-Augmented Proximal Policy Optimization for Safe Reinforcement Learning

Key Points

CPSPO improves both reward optimization and safety constraint satisfaction in reinforcement learning, enhancing overall performance.
It incorporates coupled penalties for better handling of Kullback-Leibler divergence constraints during policy optimization.
The approach uses a first-order local policy search framework, simplifying implementation for practical use.
Extensive experiments validate the effectiveness of CPSPO against mainstream safe reinforcement learning algorithms.

Abstract

Abstract Conventional penalty function-based safe reinforcement learning (RL) algorithms often handle safety and policy difference constraints separately, lacking specialized mechanisms to address scenarios where both constraints are violated simultaneously. This limitation sometimes results in severe performance degradation, underscoring the need for further investigation. To address this issue, a simple yet efficient safe RL algorithm is proposed in this work called coupled penalty-based safe policy optimization (CPSPO). CPSPO operates within the first-order local policy search framework, ensuring ease of implementation in practice. To enhance behavior correction when both constraints are violated, CPSPO introduces coupled penalties, considering the cost and Kullback-Leibler (KL)-divergence constraint violation situation simultaneously, effectively improving both constraints handling. Moreover, instead of the conventional policy ratio clipping mechanism, CPSPO directly incorporates the KL-divergence constraint in the loss function design, providing a more intuitive and practical approach to proximal safe policy optimization. Extensive experiments demonstrate that CPSPO outperforms mainstream safe RL algorithms in reward optimization and safety constraint satisfaction.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Pang et al. (Fri,) studied this question.

synapsesocial.com/papers/68af5210ad7bf08b1ead974c https://doi.org/https://doi.org/10.1088/1742-6596/3077/1/012002

Bookmark

View Full Paper