Key points are not available for this paper at this time.
DHGRPO (Domain-Induced Hierarchical Group Relative Policy Optimization) is a mathematically grounded extension of Group Relative Policy Optimization (GRPO) that mitigates group-level failure modes in preference-based fine-tuning of large language models. The method integrates: (i) robust per-prompt normalization via median and median absolute deviation (MAD) to suppress outlier influence, (ii) a Domain-Induced Factor (DIF) for trust gating based on long-term reward stability, (iii) a Domain-Optimism Parameter (DOP) for recency-weighted learning emphasis, and (iv) a bounded reward amplifier with optional magnitude matching to preserve update scale. We present a stepwise derivation from the exact policy gradient to the GRPO surrogate and its DHGRPO refinement, a controlled simulation framework with hyperparameter sweeps demonstrating consistent proxy improvements, and actionable implementation recommendations for real-world deployment in large-scale preference optimization.
DeepSeek-AI et al. (Sat,) studied this question.