What type of study is this?

This is a Quantitative Study study.

October 5, 2025Open Access

RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization

Key Points

RESTRAIN enhances model reasoning by leveraging unlabeled data, avoiding reliance on gold labels.
On AIME25, Pass@1 improved by +140.7% with RESTRAIN, showcasing its effectiveness in challenging reasoning tasks.
The methodology employs a self-penalization mechanism integrated into policy optimization methods for continual learning.
This approach signals potential for more scalable reinforcement learning without dependence on labeled datasets.

Abstract

Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce RESTRAIN (REinforcement learning with Self-restraint), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model's entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. The self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it improves Pass@1 by up to +140. 7 percent on AIME25, +36. 2 percent on MMLUSTEM, and +19. 6 percent on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper