Key points are not available for this paper at this time.
Abstract The highly coupled, underactuated, and nonlinear characteristics of quadrotors make it difficult to meet the need for efficient and stable control performance in unknown dynamic environments through the method of modeling and designing controllers. Reinforcement learning allows learning on the base of the controlled object model, updating and optimizing the control strategy with data generated from interactions with the environment, providing a new solution to this problem. However, conveying complex objectives to quadrotors is often challenging, involving the design of reward functions that need to provide sufficient information. Imitation learning can teach agents interactively by learning prior knowledge, but it also faces problems such as the difficulty of acquiring prior knowledge. In this work, our goal is to bypass the design of reward functions and improve the generalizability of quadrotors in different tasks. Specifically, we score the trajectories generated by quadrotors, learn the reward model based on preferences between different trajectories, and use it to train the quadrotors. We can demonstrate that using reward models fitted according to trajectory preferences and directly defining reward functions yields consistent results, maintaining satisfactory learning rates and performance in both “velocitycontrol” and “hoveringcontrol” tasks.
Shen et al. (Thu,) studied this question.