What type of study is this?

This is a Quantitative Study study (also classified as: Experimental Study).

October 20, 2025Open Access

RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning

Key Points

Tango framework simultaneously trains a generator and a verifier using reinforcement learning.
The generative verifier shows improved robustness and generalization compared to traditional models.
The generator achieves best-in-class performance on five benchmarks, enhancing reasoning capabilities.
Both components report substantial gains on difficult mathematical reasoning tasks.

Abstract

Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango.

Read Full Paperexternally

AI से पूछें

Bookmark

View Full Paper

Cite This Study

Zha et al. (Tue,) studied this question.

synapsesocial.com/papers/68f5a78aab63786de5b46171 https://doi.org/https://doi.org/10.48550/arxiv.2505.15034

AI से पूछें

Bookmark

View Full Paper