Stochastic policy reinforcement learning (RL) algorithms are widely used in industrial control due to their strong exploration ability and high sample efficiency. However, these algorithms often produce large action fluctuations and noise, making them unsuitable for steady-state chemical processes. To solve this problem, this study uses a thermal regeneration column (TRC) as the research object and selects the Soft Actor-Critic (SAC) algorithm as the baseline. Three strategies are introduced to improve the SAC algorithm: an action-amplitude-constrained reward function, a low-pass filter, and a Kalman filter. Experimental results show that the combination of the action-amplitude-constrained reward function and the Kalman filter achieves the best performance. Compared with the traditional SAC algorithm, the fluctuation amplitudes of steam consumption, cooling water consumption, sulfur concentration and methanol makeup rate are reduced by 85.50%, 82.81%, 90.84% and 85.49%, respectively. In addition, the fluctuation amplitude of the reward function decreases by 90.68%. This method not only optimizes operating costs but also ensures the stable operation of the TRC.
Si et al. (Thu,) studied this question.