What type of study is this?

This is a Quantitative Study study (also classified as: Experimental Study).

September 29, 2025Open Access

On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization

Puntos clave

The introduction of Lazy Likelihood Displacement highlights issues in GRPO's learning dynamic during training.
Experiments on math reasoning benchmarks indicate that NTHR provides consistent performance gains across various model sizes.
NTHR effectively addresses penalties by downweighting influences from tokens linked to Lazy Likelihood Displacement.
This research reveals crucial insights into the misalignment issues similar to those found in Direct Preference Optimization.

Resumen

Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO's widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO's learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens. Experiments on math reasoning benchmarks demonstrate that NTHR effectively mitigates LLD, yielding consistent performance gains across models ranging from 0.5B to 3B parameters.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo

Cite This Study

Deng et al. (Sat,) studied this question.

synapsesocial.com/papers/68da58d8c1728099cfd11247 https://doi.org/https://doi.org/10.48550/arxiv.2505.18830

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo