Large reasoning models are increasingly improved through post-training rather than pre-training alone, but strong post-training requires more than simply maximizing task reward. This dissertation studies reliable and efficient post-training of large reasoning models, with a focus on reinforcement learning (RL) dynamics and efficient adaptation. The first part of the dissertation studies reliable post-training by understanding the learning dynamics of RL. It shows that Group Relative Policy Optimization can suffer from Lazy Likelihood Displacement, where penalizing incorrect responses suppresses correct ones, and introduces negative token hidden reward as a selective mitigation strategy. It then extends this analysis to tool-integrated environments, identifying a collapse mechanism in multi-turn RL and proposing regularization to stabilize training. The dissertation also introduces Token Hidden Reward as a token-level quantity for steering the exploration-exploitation trade-off, and provides a unified policy-gradient view connecting direct Pass@K optimization, advantage shaping, and related surrogate objectives. The second part studies efficient adaptation across the broader post-training pipeline. It develops practical methods for valuing training data in large generative models, including forward-only and similarity-based approaches that avoid repeated retraining or costly second-order approximations. It further studies efficient adaptation and deployment through delta-parameter pruning and prompt-based federated adaptation, showing how post-trained models can be compressed, communicated, and personalized more effectively. Taken together, the dissertation presents post-training as a joint problem over optimization dynamics, objective design, data quality, and deployment cost.
Wenlong Deng (Thu,) studied this question.