What type of study is this?

This is a Experimental Study study.

October 16, 2025Open Access

Learning to Reason without External Rewards

Key Points

Intuitor achieves similar performance to GRPO on benchmarks, demonstrating effective reasoning without external rewards.
Using self-certainty as a reward signal allows for unsupervised learning across various tasks without domain-specific supervision.
Experiments show Intuitor can generalize better to tasks like code generation, enhancing its versatility compared to traditional methods.
The findings support using intrinsic signals for developing scalable AI systems, especially when external rewards are scarce.

Abstract

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

Learning to Reason without External Rewards

Key Points

Abstract

Cite This Study

Also Consider

Also Consider