Reinforcement learning (RL), and in particular Proximal Policy Optimization (PPO), has shown promise in high-precision quadrotor unmanned aerial vehicle (QUAV) control. However, the performance of PPO is highly sensitive to the choice of the clipping parameter, and inappropriate settings can lead to unstable training dynamics and excessive policy oscillations, which limit deployment in safety-critical aerial applications. To address this issue, we propose a stability-aware dynamic clipping parameter adjustment strategy, which adapts the clipping threshold ϵt in real time based on a stability variance metric St. This adaptive mechanism balances exploration and stability throughout the training process. Furthermore, we provide a Lipschitz continuity interpretation of the clipping mechanism, showing that its adaptation implicitly adjusts a bound on the policy update step, thereby offering a deterministic guarantee on the oscillation magnitude. Extensive simulation results demonstrate that the proposed method reduces policy variance by 45% and accelerates convergence compared to baseline PPO, resulting in smoother control responses and improved robustness under dynamic operating conditions. While developed within the PPO framework, the proposed approach is readily applicable to other on policy policy gradient methods.
Quan et al. (Fri,) studied this question.