This study evaluates deep reinforcement learning (DRL) for local continuous-control decision-making in autonomous last-mile delivery simulation. A custom 2D environment, implemented in Pygame, was built from a map of Dublin city centre with lane-bounded drivable areas, zone-specific speed limits and multiple delivery locations. The agent observes a 16-dimensional state and executes continuous actions for acceleration/deceleration and steering. Global route guidance is provided by the A* search algorithm, while deliveries are scheduled using a nearest-first baseline. Four DRL algorithms were compared: Twin-Delayed Deep Deterministic Policy Gradient (TD3), Soft Actor–Critic (SAC), Deep Deterministic Policy Gradient (DDPG) and Proximal Policy Optimization (PPO), trained for up to two million time steps and evaluated with and without imitation learning (IL). Reward functions captured efficiency and safety, with penalties for collisions and stagnation. Across repeated trials of four deliveries per episode, IL accelerated learning and improved policy stability. TD3 with IL completed all deliveries in 96–100% of evaluation episodes, while SAC with IL achieved the highest reward. DDPG and PPO without IL failed to complete any deliveries. Overall, IL improved reward by 30–45% and removed stagnation, demonstrating that DRL with IL can enhance local-control performance, delivery completion and learning speed within an A*-guided realistic urban-map simulation.
Odekunle et al. (Mon,) studied this question.