ABSTRACT In reinforcement learning (RL), the assumption of fixed control frequency often leads to computational resource wastage and degraded policy performance, while traditional single‐step temporal difference (TD) learning suffers from accumulated state‐value estimation bias. This paper proposes the multi‐state soft elastic actor‐critic (MSSEAC) algorithm to address these issues: First, the paper introduces a temporal consumption penalty mechanism and reconstructs the actor network's dual‐branch output structure to simultaneously generate control actions and time consumption estimates, enabling autonomous control frequency adjustment. Second, the multi‐state temporal difference (MSTD) framework is developed to address the limitations of conventional single‐step TD learning. Specifically, an innovative experience replay buffer management strategy is proposed, where historical actions are utilized to stabilize the learning process during initial training phases, with a gradual transition to policy‐generated actions in later stages to enhance estimation accuracy. The multi‐state‐value estimation effectively mitigates the bias accumulation problem inherent in single‐step TD methods through weighted fusion of return distributions from multiple future states. Code is available at: https://github.com/asdwqqqq/MSSEAC.git .
Gu et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: