What type of study is this?

This is a Quantitative Study study.

October 20, 2025Open Access

Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm

Key Points

Trust Region Reward Optimization guarantees improvements in the likelihood of expert behavior, reducing instability in training.
Proximal Inverse Reward Optimization demonstrates high sample efficiency and matches state-of-the-art baselines in reward recovery.
Energy-based formulations bridge gaps in stability for non-adversarial inverse reinforcement learning approaches.
Empirical evaluations on MuJoCo and Gym-Robotics benchmarks highlight effective policy imitation and performance enhancements.

Abstract

Inverse Reinforcement Learning (IRL) learns a reward function to explain expert demonstrations. Modern IRL methods often use the adversarial (minimax) formulation that alternates between reward and policy optimization, which often lead to unstable training. Recent non-adversarial IRL approaches improve stability by jointly learning reward and policy via energy-based formulations but lack formal guarantees. This work bridges this gap. We first present a unified view showing canonical non-adversarial methods explicitly or implicitly maximize the likelihood of expert behavior, which is equivalent to minimizing the expected return gap. This insight leads to our main contribution: Trust Region Reward Optimization (TRRO), a framework that guarantees monotonic improvement in this likelihood via a Minorization-Maximization process. We instantiate TRRO into Proximal Inverse Reward Optimization (PIRO), a practical and stable IRL algorithm. Theoretically, TRRO provides the IRL counterpart to the stability guarantees of Trust Region Policy Optimization (TRPO) in forward RL. Empirically, PIRO matches or surpasses state-of-the-art baselines in reward recovery, policy imitation with high sample efficiency on MuJoCo and Gym-Robotics benchmarks and a real-world animal behavior modeling task.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Chen et al. (Sat,) studied this question.

synapsesocial.com/papers/68f6196ee0bbbc94fac364dd https://doi.org/https://doi.org/10.48550/arxiv.2509.23135

Bookmark

View Full Paper