March 3, 2026Open Access

Reinforcement learning for multi-objective multi-echelon supply chain optimisation

Key Points

Reinforcement learning achieves better trade-offs among economic, environmental, and social objectives, enhancing overall stability.
The study uses a Markov decision process to develop a flexible model that adapts to changing market conditions and complexity.
Benchmarking against single-objective methods shows reinforcement learning provides more balanced solutions in real-world scenarios.
The results highlight significant improvements in operational robustness and production stability across various supply chain configurations.

Abstract

• Develops a Markov-based multi-objective model for supply chain optimisation. • Introduces a customisable simulator for dynamic supply chain decision-making. • Applies reinforcement learning to balance economic, environmental, and social goals. • Finds that reinforcement learning achieves better trade-offs among objectives. • Shows that reinforcement learning improves operational stability and balance. This study develops a generalised multi-objective, multi-echelon supply chain optimisation model with non-stationary markets based on a Markov decision process, incorporating economic, environmental, and social considerations. The model is evaluated using a multi-objective reinforcement learning (RL) method, benchmarked against an originally single-objective RL algorithm modified with weighted sum using predefined weights, and a multi-objective evolutionary algorithm (MOEA)-based approach. We conduct experiments on varying network complexities, mimicking typical real-world challenges using a customisable simulator. The model determines production and delivery quantities across supply chain routes to achieve near-optimal trade-offs between competing objectives, approximating Pareto front sets. The results demonstrate that the primary approach provides the most balanced trade-off between optimality, diversity, and density, further enhanced with a shared experience buffer that allows knowledge transfer among policies. In complex settings, it achieves approximately ten times higher hypervolume than the MOEA-based method and generates solutions that are twenty-six times denser, signifying better robustness, than those produced by the modified single-objective RL method. Moreover, it ensures stable production and inventory levels while minimising demand loss.

Reinforcement learning for multi-objective multi-echelon supply chain optimisation

Key Points

Abstract

Cite This Study