What type of study is this?

This is a Experimental Study study.

October 16, 2025Open Access

AgentRM: Enhancing Agent Generalization with Reward Modeling

Key Points

AgentRM enhances the policy model's performance by 8.8 points on average across nine agent tasks, surpassing existing agents.
The approach demonstrates weak-to-strong generalization, achieving a 12.6 point improvement on the LLaMA-3-70B model.
By employing techniques like Best-of-N sampling, the method effectively guides answer generation during test-time.
AgentRM also exhibits strong specialization, outperforming the top specialized agent by 11.4 points on three held-in tasks.

Abstract

Existing LLM-based agents have achieved strong performance on held-in tasks, but their generalizability to unseen tasks remains poor. Hence, some recent work focus on fine-tuning the policy model with more diverse tasks to improve the generalizability. In this work, we find that finetuning a reward model to guide the policy model is more robust than directly finetuning the policy model. Based on this finding, we propose AgentRM, a generalizable reward model, to guide the policy model for effective test-time search. We comprehensively investigate three approaches to construct the reward model, including explicit reward modeling, implicit reward modeling and LLM-as-a-judge. We then use AgentRM to guide the answer generation with Best-of-N sampling and step-level beam search. On four types of nine agent tasks, AgentRM enhances the base policy model by 8. 8 points on average, surpassing the top general agent by 4. 0. Moreover, it demonstrates weak-to-strong generalization, yielding greater improvement of 12. 6 on LLaMA-3-70B policy model. As for the specializability, AgentRM can also boost a finetuned policy model and outperform the top specialized agent by 11. 4 on three held-in tasks. Further analysis verifies its effectiveness in test-time scaling. Codes will be released to facilitate the research in this area.

Read Full Paperexternally

Perguntar à IA

Bookmark

View Full Paper