What question did this study set out to answer?

This study aims to evaluate the effectiveness of deep reinforcement learning (DRL) algorithms in autonomous driving for last-mile delivery.

June 18, 2026Open Access

Deep Reinforcement Learning-Based Autonomous Driving in Urban Environments

Key Points

This study aims to evaluate the effectiveness of deep reinforcement learning (DRL) algorithms in autonomous driving for last-mile delivery.
Evaluated four DRL algorithms: TD3, SAC, DDPG, and PPO in a custom 2D simulation based on Dublin.
Agent uses a 16-dimensional state for decision-making and continuous actions for control.
Imitation learning was implemented to enhance training efficiency and policy stability.
TD3 with imitation learning completed deliveries in 96-100% of episodes, achieving high stability.
SAC with imitation learning received the highest overall reward.
DDPG and PPO without imitation learning failed to complete any deliveries.

Abstract

This study evaluates deep reinforcement learning (DRL) for local continuous-control decision-making in autonomous last-mile delivery simulation. A custom 2D environment, implemented in Pygame, was built from a map of Dublin city centre with lane-bounded drivable areas, zone-specific speed limits and multiple delivery locations. The agent observes a 16-dimensional state and executes continuous actions for acceleration/deceleration and steering. Global route guidance is provided by the A* search algorithm, while deliveries are scheduled using a nearest-first baseline. Four DRL algorithms were compared: Twin-Delayed Deep Deterministic Policy Gradient (TD3), Soft Actor–Critic (SAC), Deep Deterministic Policy Gradient (DDPG) and Proximal Policy Optimization (PPO), trained for up to two million time steps and evaluated with and without imitation learning (IL). Reward functions captured efficiency and safety, with penalties for collisions and stagnation. Across repeated trials of four deliveries per episode, IL accelerated learning and improved policy stability. TD3 with IL completed all deliveries in 96–100% of evaluation episodes, while SAC with IL achieved the highest reward. DDPG and PPO without IL failed to complete any deliveries. Overall, IL improved reward by 30–45% and removed stagnation, demonstrating that DRL with IL can enhance local-control performance, delivery completion and learning speed within an A*-guided realistic urban-map simulation.

Deep Reinforcement Learning-Based Autonomous Driving in Urban Environments

Key Points

Abstract

Cite This Study