What question did this study set out to answer?

This research aims to improve inverse reinforcement learning by developing a federated algorithm that can infer rewards from decentralized data.

May 22, 2026

Recovering Reward Functions From Distributed Expert Demonstrations via Bi-Level Maximum-Likelihood Optimization

Key Points

This research aims to improve inverse reinforcement learning by developing a federated algorithm that can infer rewards from decentralized data.
Proposed a federated maximum-likelihood IRL (F-ML-IRL) algorithm.
Leveraged dual aggregation for model updates and bi-level local updates for reward optimization and policy improvement.
Conducted evaluations on high-dimensional robotic control tasks in MuJoCo.
The F-ML-IRL ensured convergence of the recovered reward in decentralized learning.
Outperformed centralized baselines in 12 out of 20 tasks through better use of distributed data.
Convergence analysis confirmed that policy and reward parameters reach a stationary point efficiently.

Abstract

Inverse reinforcement learning (IRL) seeks to infer the latent reward function and the associated optimal policy from expert demonstrations. However, most current IRL methods assume centralized access to all trajectory data, which is impractical in real-world scenarios characterized by decentralized data sources and privacy concerns. To this end, this article proposes a novel algorithm for federated maximum-likelihood IRL (F-ML-IRL) and provides a rigorous analysis of its convergence rate. The proposed F-ML-IRL leverages dual aggregation to update the shared global model and performs bi-level local updates: an upper level learning task to optimize the parameterized reward function by maximizing the discounted likelihood of observing human expert trajectories under the current policy, and a lower level learning task to find the optimal agent policy regarding the entropy-regularized discounted cumulative reward under the current reward function. We analyze the convergence rate of the proposed F-ML-IRL algorithm and show that the global model in F-ML-IRL converges to a stationary point for both the reward and policy parameters within finite time. That is, the log-distance between the recovered policy and the optimal policy, as well as the gradient of the likelihood objective, converges to zero. Evaluating our F-ML-IRL algorithm on high-dimensional robotic control tasks in MuJoCo, we show that it ensures convergence of the recovered reward in decentralized learning and outperforms centralized baselines due to its ability to utilize distributed data-attaining better recovered rewards than all baselines in 12 out of 20 tasks.

Bookmark

Cite This Study

Jiang et al. (Thu,) studied this question.

synapsesocial.com/papers/6a0ff2f5d674f7c03778b652 https://doi.org/https://doi.org/10.1109/tnnls.2026.3688600

Bookmark