What type of study is this?

This is a Quantitative Study study (also classified as: Experimental Study).

October 20, 2025Open Access

Agentic Reinforcement Learning with Implicit Step Rewards

Key Points

iStar demonstrates superior performance compared to existing language models and strong RL baselines.
The method incorporates implicit step rewards that optimize the policy model, enhancing training stability.
iStar efficiently explores and achieves higher rewards in both step- and episode-level metrics.
This approach eliminates the need for explicit step labels, streamlining the reinforcement learning process.

Abstract

Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL) that reason and act in interactive environments. However, sparse and sometimes unverifiable rewards make it extremely challenging to assign credit when training LLM agents that serve as a policy. Recent work attempts to integrate process supervision into RL but suffers from biased annotation, reward hacking, high-variance from overly fine-grained rewards or failtures when state overlap is rare. We therefore introduce implicit step rewards for agentic RL (iStar), a general credit-assignment strategy that integrates seamlessly with standard RL algorithms without relying on additional rollouts or explicit step labels. Particularly, we alternatively optimize an implicit process reward model (PRM) with the policy model to generate implicit step rewards via a trajectory-based DPO objective. Theoretical analysis shows that this learning objective produces a step-wise reward function. Then the implicit step rewards are used to compute step-level advantages, which are combined with trajectory (or episode)-level advantages for policy updates, creating a self-reinforcing training loop. We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverifiable rewards in SOTOPIA. Crucially, iStar shows superior performance over frontier LLMs and strong RL baselines across domains, achieving state-of-the-art results with higher sample-efficiency and training stability. Further analysis also demonstrates efficient exploration by iStar with increased rewards in both step- and episode-level while maintaining fewer steps to achieve task success. Code will be available soon.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper