For a long time, reinforcement learning (RL) has led people to imagine agents figuring out complicated things to do, simply by being given a basic reward for success. But getting these simulated agents to actually show what you might call genuine intelligence is still incredibly hard. This paper looks at this problem in two ways, comparing a single agent learning on its own and many agents learning with and against each other. The examples are a self-driving race car and a game of hide-and-seek. For the car, a simplified Rainbow DQN (a way of teaching an agent to learn) is used on a detailed computer model of the Jeddah Corniche race track. And in the hide-and-seek, multiple hiders and seekers are learning to use objects (specifically, boxes) to their advantage in a 30 by and 30 by 30 area with boxes that can be moved and walls that can't, and they do this using something called proximal policy optimization (PPO). The racing simulation is amazingly fast at around 300,000 steps per second for the car, and it does this by calculating how the car's 'eyes' (LiDAR), any crashes, and passing checkpoints work for up to 1000 cars all at the same time. The 'Rainbow-lite' system (combining a double DQN with prioritized experience replay, looking at three steps into the future and using NoisyNet) helps the car learn efficiently without needing to randomly try things out. The way the car is rewarded for getting closer to checkpoints, being penalized for not doing anything, and for spinning out, essentially means it teaches itself in stages. It starts by learning to go fast and go around corners, and also learning to avoid walls, all starting from being completely random. In the hide-and-seek, the hiders and seekers are trained together but act independently, and the PPO system with generalized advantage estimation drives some surprisingly clever behaviour. The hiders figure out to shove boxes to make an L-shaped hidey-hole against a wall, and the seekers then learn to break down these shelters to get at them. Receiving a reward for being able to see at each step and winning at the end creates a situation where they are constantly trying to outdo each other, and the game goes through six clear phases, starting with just random movement and eventually leading to teamwork to build fortresses with multiple boxes. Both of these projects share some important technical features: the physics of everything is done with fast, all-in-one NumPy calculations, the networks (the 'brains' of the agents) are made more stable with LayerNorm and orthogonal initialization, the target values are updated smoothly, and Pygame is used for a visual display which makes it easier to see what is happening. When you compare the two, it's clear that how you reward the agents and how interesting the environment is are more important for creating intelligence than how complicated the underlying system is. The experiments show the agents do reliably reach a solution, the system can handle lots of agents, and the learning can be applied elsewhere; this has implications for self-driving cars, robots manipulating objects, and robots working together. The main point of this work is that really complex behaviour can come from very simple reinforcement learning methods, if they're carefully designed.
Building similarity graph...
Analyzing shared references across papers
Loading...
Dostonbek Abdurakhmonov
Building similarity graph...
Analyzing shared references across papers
Loading...
Dostonbek Abdurakhmonov (Mon,) studied this question.
synapsesocial.com/papers/69d895046c1944d70ce05f27 — DOI: https://doi.org/10.5281/zenodo.19451817