Production control optimization in process industries is often challenged by complex physicochemical processes that are difficult to model mathematically. As a model-free approach, Reinforcement Learning (RL) offers a promising solution. However, online action exploration through trial-and-error risks compromising equipment safety and efficiency, while offline training suffers from sampling bias due to limited and imbalanced datasets, particularly the scarcity of faulty operation data. To address these issues, this study proposes a world model-driven operational framework that integrates conditional diffusion with offline RL. By leveraging the distribution approximation capability of diffusion models, we introduce a conditional trajectory generation mechanism constrained by operational parameters and historical state transitions. This allows the diffusion model to produce near-realistic state trajectories and reward signals, constructing an interactive virtual state–action–reward space. We further employ autoregressive generation of imagined trajectories to support RL agent training. During world model training, a spatiotemporal Transformer architecture is incorporated to capture dependencies along state–action trajectories. For offline agent training, a Twin-Delayed Deep Deterministic policy gradient-based RL model regularized by behavior cloning is adopted. Experiments on a tobacco leaf-processing line demonstrate that the proposed conditional diffusion-based offline RL method accurately constructs a virtual sample space with a mean squared error of 1.27e−4, significantly reducing policy acquisition costs. The resulting RL-driven parameter adjustment achieves an approximately 12% improvement in the product qualification rate compared to other state-of-the-art offline RL algorithms. Our algorithm implementation and evaluation dataset can be found here: https://github.com/sizizuo0076/WM-PIO-ORL . • A diffusion-based world model is proposed to guide offline decision agent training. • A spatiotemporal Transformer is used for noise prediction in the diffusion model. • Reinforcement learning with behavior cloning is designed for continuous control. • The proposed framework is validated on a real tobacco shredding production line. • The validation shows a 17.2% quality improvement for process production control.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yanlei Yin
Raymond Chiong
Chao Deng
Computers in Industry
Nanyang Technological University
University of Newcastle Australia
Kunming University of Science and Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Yin et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69994aab873532290d01f1cb — DOI: https://doi.org/10.1016/j.compind.2026.104442
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: