August 9, 2025Open Access

DHGRPO: Domain-Induced, Hierarchical Group Relative Policy Optimization

Key Points

Key points are not available for this paper at this time.

Abstract

DHGRPO (Domain-Induced Hierarchical Group Relative Policy Optimization) is a mathematically grounded extension of Group Relative Policy Optimization (GRPO) that mitigates group-level failure modes in preference-based fine-tuning of large language models. The method integrates: (i) robust per-prompt normalization via median and median absolute deviation (MAD) to suppress outlier influence, (ii) a Domain-Induced Factor (DIF) for trust gating based on long-term reward stability, (iii) a Domain-Optimism Parameter (DOP) for recency-weighted learning emphasis, and (iv) a bounded reward amplifier with optional magnitude matching to preserve update scale. We present a stepwise derivation from the exact policy gradient to the GRPO surrogate and its DHGRPO refinement, a controlled simulation framework with hyperparameter sweeps demonstrating consistent proxy improvements, and actionable implementation recommendations for real-world deployment in large-scale preference optimization.

Read Full Paperexternally

Perguntar à IA

Bookmark

View Full Paper

Cite This Study

DeepSeek-AI et al. (Sat,) studied this question.

synapsesocial.com/papers/6966674c59c617c4b57f771f https://doi.org/https://doi.org/10.5281/zenodo.16786368

Perguntar à IA

Bookmark

View Full Paper