What type of study is this?

This is a Experimental Study study.

October 5, 2025Open Access

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

Key Points

Corrections to binary rewards enhance policy gradient estimation, yielding improved training outcomes across various models.
Both correction methods—backward and forward—show effectiveness, with the forward variant converging faster under noise.
Implemented as lightweight hooks, these methods are integrated into a group relative policy optimization pipeline for efficiency.
The practical appeal mechanism for FN rate estimation by an LLM verifier offers competitive advantages over existing systems.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) trains policies against automated verifiers to avoid costly human labeling. To reduce vulnerability to verifier hacking, many RLVR systems collapse rewards to binary \0, 1\ during training. This choice carries a cost: it introduces false negatives (rejecting correct answers, FNs) and false positives (accepting incorrect ones, FPs). For instance, a rule-based checker may mark the correct fraction 1236 as wrong when compared against the canonical 13 due to brittle parsing/equivalence rules (FN), while a large language model (LLM) judges can be gamed by superficial cues or even a single adversarial token, yielding inflated correctness for wrong solutions (FP). We formalize verifier unreliability by modeling the verifier as a stochastic reward channel with asymmetric noise rates. From this abstraction, we derive two correction algorithms for verifier errors. The first is a backward correction that de-biases the observed binary reward to recover an unbiased estimator of the clean policy gradient. The second is a forward correction that reweights score-function terms so that the expected update direction aligns with the clean gradient; notably, it requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization (GRPO) -based RLVR pipeline and evaluate them on math-reasoning models and benchmarks. Across models and datasets, both corrections improve over uncorrected training; the forward variant converges faster and remains stable under heavier noise. Finally, we show a practical appeal mechanism in which a lightweight LLM verifier estimates the FN rate online by rechecking rule-based negatives, obtaining outperformance compared with other state-of-the-art contenders.

Read Full Paperexternally

AI से पूछें

Bookmark

View Full Paper

Cite This Study

Cai et al. (Wed,) studied this question.

synapsesocial.com/papers/68e25378d6d66a53c24742b0 https://doi.org/https://doi.org/10.48550/arxiv.2510.00915

AI से पूछें

Bookmark

View Full Paper