What type of study is this?

This is a Quantitative Study study.

October 2, 2025Open Access

wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models

Key Points

wd1 achieves up to 16% higher accuracy on reasoning benchmarks, demonstrating significant improvement in performance.
The method requires only a single approximation for the current policy likelihood, reducing computational overhead effectively.
Experiments show wd1 outperforms existing RL methods for diffusion language models without needing supervised fine-tuning.
Added gains in training speed and efficiency make wd1 a compelling choice for optimizing reasoning tasks in language models.

Abstract

Improving the reasoning capabilities of diffusion-based large language models (dLLMs) through reinforcement learning (RL) remains an open problem. The intractability of dLLMs likelihood function necessitates approximating the current, old, and reference policy likelihoods at each policy optimization step. This reliance introduces additional computational overhead and lead to potentially large bias -- particularly when approximation errors occur in the denominator of policy ratios used for importance sampling. To mitigate these issues, we introduce wd1, a novel policy optimization approach that reformulates the objective as a weighted likelihood, requiring only a single approximation for the current parametrized policy likelihood. Experiments on widely used reasoning benchmarks demonstrate that wd1, without supervised fine-tuning (SFT) or any supervised data, outperforms existing RL methods for dLLMs, achieving up to 16% higher accuracy. wd1 delivers additional computational gains, including reduced training time and fewer function evaluations (NFEs) per gradient step. These findings, combined with the simplicity of method's implementation and R1-Zero-like training (no SFT), position wd1 as a more effective and efficient method for applying RL to dLLMs reasoning.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper