What type of study is this?

This is a Quantitative Study study.

October 5, 2025Open Access

Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning

Key Points

Dense reasoning rewards provide step-level feedback to optimise reasoning policies during training.
The approach improves predictive performance when used as a learning signal in language models.
Expert demonstrations guide the learning process, prioritising correctness over mere appearance.
This model suggests reusable rewards that can significantly enhance multi-step reasoning capabilities.

Abstract

We reframe and operationalise adversarial inverse reinforcement learning (IRL) to large language model reasoning, learning a dense, token-level reward model for process supervision directly from expert demonstrations rather than imitating style via supervised fine-tuning. The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets. We demonstrate that our approach prioritises correctness over surface form, yielding scores that correlate with eventual answer validity and enabling interpretable localisation of errors within a trace. Empirically, on GSM8K with Llama3 and Qwen2.5 backbones, we demonstrate: (i) dense reasoning rewards can be used as a learning signal to elicit reasoning, and (ii) predictive performance is improved from reward-guided reranking (notably for Llama-based policies). By unifying training signals, inference-time selection, and token-level diagnostics into a single reasoning reward, this work suggests reusable process-level rewards with broad potential to enhance multi-step reasoning in language models.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper