What type of study is this?

This is a Experimental Study study.

October 10, 2025Open Access

Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

Key Points

SFPO outperforms GRPO by up to 2.80 points in math reasoning benchmarks, indicating significant improvement.
The method achieves up to 4.93× fewer rollouts and reduces wall-clock time by 4.19× compared to GRPO.
The reposition-before-update design ensures compatibility with existing policy-gradient pipelines for smoother implementation.
Extensive experiments validate that SFPO enhances training stability and accelerates convergence for large language models.

Abstract

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2. 80 points in average on math reasoning benchmarks. It also achieves up to 4. 93 fewer rollouts and a 4. 19 reduction in wall-clock time to match GRPO's best accuracy.

Read Full Paperexternally

Perguntar à IA

Bookmark

View Full Paper