September 16, 2025Open Access

An improved actor-critic architecture with PPO for the traveling salesman problem

Key Points

Key points are not available for this paper at this time.

Abstract

• We propose a novel Actor-Critic model with PPO for solving the TSP. • Our adaptive scheduling method improves learning efficiency and stability. • Our model outperforms baselines by 43-46 % on small TSP instances. • We scale our method to 1400+ cities, beyond the reach of prior RL methods. The traveling salesman problem (TSP) is a classic NP-hard problem in combinatorial optimization with extensive practical applications. In this paper, we present an improved Actor-Critic architecture incorporating Proximal Policy Optimization (PPO) to effectively solve TSP. We introduce adaptive temperature scheduling, comprehensive state representation, and layer normalization to enhance learning stability. Experimental results demonstrate our Improved Actor-Critic approach achieves significant improvements ranging from 8. 7 % to 55. 9 % for different problem sizes compared to established reinforcement learning baselines including Q-Learning, SARSA, Double Q-Learning, Actor-Critic with Experience Replay (ACER), and Trust Region Policy Optimization (TRPO), with particularly strong performance on smaller instances between 20 to 100 cities. When testing on standard TSPLIB benchmarks, our method shows consistent advantages of 12 % to 33 % compared to classical approaches While tabular methods become computationally infeasible beyond 250 cities due to memory constraints, our approach maintains high solution quality for problems up to 1432 cities on our experimental setup (Intel® Core™i9-10900X CPU @ 3. 70GHz × 20 with four NVIDIA Quadro RTX 5000 GPUs). Our ablation studies confirm the importance of each component in our proposed architecture, in which the improved state representation provides the most significant contribution to our model performance. This research significantly advances reinforcement learning approaches to combinatorial optimization, with practical implications for logistics, telecommunications, and manufacturing. The developed source code is available at: https: //github. com/LetuQingge/TSPEnvironment.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper