What type of study is this?

This is a Experimental Study study.

October 5, 2025Open Access

Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning

Puntos clave

CAPO improves sample efficiency by up to 30x compared to standard methods in math reasoning tasks.
The algorithm utilizes curvature information to stabilize policy gradient updates, ensuring reliability.
CAPO achieves stable updates while masking fewer than 8% of tokens during training phases.
Theoretical guarantees provide confidence in CAPO's performance under aggressive learning conditions.

Resumen

Reinforcement Learning, particularly through policy gradient methods, has played a central role in enabling reasoning capabilities of Large Language Models. However, the optimization stability of policy gradients in this setting remains understudied. As a result, existing implementations often resort to conservative hyperparameter choices to ensure stability, which requires more training samples and increases computational costs. Hence, developing models for reliably tracking the underlying optimization dynamics and leveraging them into training enables more sample-efficient regimes and further unleashes scalable post-training. We address this gap by formalizing the stochastic optimization problem of policy gradients with explicit consideration of second-order geometry. We propose a tractable computational framework that tracks and leverages curvature information during policy updates. We further employ this framework to design interventions in the optimization process through data selection. The resultant algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out. Theoretically, we establish monotonic improvement guarantees under realistic assumptions. On standard math reasoning benchmarks, we empirically show that CAPO ensures stable updates under aggressive learning regimes where baselines catastrophically fail. With minimal intervention (rejecting fewer than 8% of tokens), CAPO achieves up to 30x improvement in sample efficiency over standard GRPO for LLM reasoning.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo

Cite This Study

Melo et al. (Wed,) studied this question.

synapsesocial.com/papers/68e2537cd6d66a53c24743c8 https://doi.org/https://doi.org/10.48550/arxiv.2510.00819

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo