What question did this study set out to answer?

To propose and analyze a mirror descent safe policy optimization algorithm for reinforcement learning agents that enhances safety while maximizing returns.

March 21, 2026

Mirror Descent Safe Policy Optimization for Reinforcement Learning Agents

Puntos clave

To propose and analyze a mirror descent safe policy optimization algorithm for reinforcement learning agents that enhances safety while maximizing returns.
Develop the MDSPO algorithm using mirror descent optimization.
Formulate an innovative optimization objective.
Implement a three-stage optimization strategy without hard constraints.
MDSPO improves average return by approximately 12%.
Demonstrates better compliance with safety constraints compared to state-of-the-art methods.
Successfully finds optimal paths in real-world obstacle avoidance scenarios.

Resumen

Embodied intelligence and related disciplines have identified several mechanisms that help embodied agents learn how to solve complex problems. Reinforcement learning (RL) is one of the most promising computational approaches toward enhancement of the learning-based problem-solving abilities of such agents. Given the recent rapid evolution of artificial intelligence, RL has become a keystone technology, accelerating scientific discoveries and also finding applications in many other domains. In RL, an agent collects data when interacting with the environment, which optimizes a policy ensuring a higher return. Further improvement requires more exploration of the action space. However, not all actions in that space are safe and acceptable. The exploration of an agent must be constrained. In this work, a novel mirror descent safe policy optimization (MDSPO) algorithm is proposed to ensure the safety of an RL agent. The algorithm leverages mirror descent optimization to maximize the return while satisfying the safety constraint. A novel optimization objective is formulated, and an innovative three-stage optimization strategy is employed-comprising gradient descent without the cost constraint, projection onto the nonparametric policy space with the cost constraint, and projection onto the parametric policy space. Compared to previous methods, MDSPO is a simple and easy to implement first-order approach, which does not impose a hard constraint on the trust region. Theoretical analysis of the MDSPO reveals a lower bound on return improvement and an upper bound on constraint violation at the time of each policy update. The numerical results obtained from two sets of different constrained locomotive experiments demonstrate that MDSPO improves the average return by about 12% and better satisfies the cost constraints than other state-of-the-art methods do. In a real-world obstacle avoidance experiment using an unmanned surface vessel, MDSPO both finds the optimal path and guarantees agent safety.

Me gusta

Guardar

Me gusta

Guardar

Mirror Descent Safe Policy Optimization for Reinforcement Learning Agents

Puntos clave

Resumen

Cite This Study