What question did this study set out to answer?

The dissertation aims to improve post-training for large reasoning models by analyzing reinforcement learning dynamics and adaptation strategies.

June 14, 2026Open Access

Reliable and efficient post-training of large reasoning models : reinforcement learning dynamics and adaptation

Key Points

The dissertation aims to improve post-training for large reasoning models by analyzing reinforcement learning dynamics and adaptation strategies.
Investigated Group Relative Policy Optimization and its issues, such as Lazy Likelihood Displacement.
Developed strategies for stable training in multi-turn reinforcement learning environments.
Introduced and analyzed various methods for efficient adaptation in the post-training pipeline.
Identified Lazy Likelihood Displacement issue, proposing negative token hidden rewards to enhance learning dynamics.
Proposed regularization methods to stabilize training in tool-integrated environments.
Demonstrated effective data valuation and adaptation strategies, leading to improved model compression and personalization.

Abstract

Large reasoning models are increasingly improved through post-training rather than pre-training alone, but strong post-training requires more than simply maximizing task reward. This dissertation studies reliable and efficient post-training of large reasoning models, with a focus on reinforcement learning (RL) dynamics and efficient adaptation. The first part of the dissertation studies reliable post-training by understanding the learning dynamics of RL. It shows that Group Relative Policy Optimization can suffer from Lazy Likelihood Displacement, where penalizing incorrect responses suppresses correct ones, and introduces negative token hidden reward as a selective mitigation strategy. It then extends this analysis to tool-integrated environments, identifying a collapse mechanism in multi-turn RL and proposing regularization to stabilize training. The dissertation also introduces Token Hidden Reward as a token-level quantity for steering the exploration-exploitation trade-off, and provides a unified policy-gradient view connecting direct Pass@K optimization, advantage shaping, and related surrogate objectives. The second part studies efficient adaptation across the broader post-training pipeline. It develops practical methods for valuing training data in large generative models, including forward-only and similarity-based approaches that avoid repeated retraining or costly second-order approximations. It further studies efficient adaptation and deployment through delta-parameter pruning and prompt-based federated adaptation, showing how post-trained models can be compressed, communicated, and personalized more effectively. Taken together, the dissertation presents post-training as a joint problem over optimization dynamics, objective design, data quality, and deployment cost.

Mark Helpful

Bookmark

Relay

View Full Paper