Los puntos clave no están disponibles para este artículo en este momento.
The history of learning for control has been an exciting back and forth two broad classes of algorithms: planning and reinforcement learning. algorithms effectively reason over long horizons, but assume access to local policy and distance metric over collision-free paths. Reinforcement excels at learning policies and the relative values of states, but to plan over long horizons. Despite the successes of each method in domains, tasks that require reasoning over long horizons with limited and high-dimensional observations remain exceedingly challenging for planning and reinforcement learning algorithms. Frustratingly, these sorts tasks are potentially the most useful, as they are simple to design (a human need to provide an example goal state) and avoid reward shaping, which can the agent towards finding a sub-optimal solution. We introduce a general algorithm that combines the strengths of planning and reinforcement to effectively solve these tasks. Our aim is to decompose the task of a distant goal state into a sequence of easier tasks, each of which to reaching a subgoal. Planning algorithms can automatically find waypoints, but only if provided with suitable abstractions of the -- namely, a graph consisting of nodes and edges. Our main insight that this graph can be constructed via reinforcement learning, where a-conditioned value function provides edge weights, and nodes are taken to previously seen observations in a replay buffer. Using graph search over our buffer, we can automatically generate this sequence of subgoals, even in-based environments. Our algorithm, search on the replay buffer (SoRB), agents to solve sparse reward tasks over one hundred steps, and substantially better than standard RL algorithms.
Eysenbach et al. (Wed,) studied this question.