What type of study is this?

This is a Quantitative Study study.

October 5, 2025Open Access

Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO

Key Points

Stepwise guided policy optimization consistently outperforms group relative policy optimization during training.
SGPO shows improved learning dynamics by addressing the all-negative-sample limitation in language models.
Empirical validation demonstrates SGPO's advantages across various model sizes in both offline and online training.
Diverse response generation within groups allows SGPO to learn effectively, even when incorrect responses are present.

Abstract

Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO), has shown strong empirical results in training DeepSeek-R1. However, GRPO fails to update the policy when all responses within a group are incorrect (i. e. , all-negative-sample groups). This limitation underscores a key gap between artificial and human intelligence: unlike humans, who can learn from mistakes, GRPO discards these signals. Our first contribution is to introduce a simple framework that mitigates the all-negative-sample issue by incorporating response diversity within groups using a step-wise judge model, which can be either directly trained or adapted from existing LLMs. We prove that this diversification can accelerate GRPO's learning dynamics in a simplified setting. We also empirically validate the proposed stepwise guided policy optimization (SGPO) method, demonstrating consistent gains across model sizes (7B, 14B, 32B) in offline and online training on 9 benchmarks, including base and distilled variants. Our results highlight two advantages: (i) SGPO surpasses GRPO, especially in the early and mid-training stages where all-negative-sample groups are prevalent; and (ii) SGPO does not require judge models to generate correct answers, differentiating it from knowledge distillation methods.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper

Cite This Study

Chen et al. (Fri,) studied this question.

synapsesocial.com/papers/68e24e6fd6d66a53c2473d84 https://doi.org/https://doi.org/10.48550/arxiv.2505.11595

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper