What type of study is this?

This is a Quantitative Study study.

September 30, 2025Open Access

Revisiting Actor-Critic Methods in Discrete Action Off-Policy Reinforcement Learning

Key Points

Decoupling the entropy between actor and critic improves DSAC performance, matching DQN levels.
The proposed framework integrates m-step Bellman operator for enhanced critic updates and actor objectives.
Theoretically guarantees convergence to the optimal regularized value function within a tabular approach.
Empirical results show performance on Atari games is comparable to DQN, even without explicit exploration.

Abstract

Value-based approaches such as DQN are the default methods for off-policy reinforcement learning with discrete-action environments such as Atari. Common policy-based methods are either on-policy and do not effectively learn from off-policy data (e.g. PPO), or have poor empirical performance in the discrete-action setting (e.g. SAC). Consequently, starting from discrete SAC (DSAC), we revisit the design of actor-critic methods in this setting. First, we determine that the coupling between the actor and critic entropy is the primary reason behind the poor performance of DSAC. We demonstrate that by merely decoupling these components, DSAC can have comparable performance as DQN. Motivated by this insight, we introduce a flexible off-policy actor-critic framework that subsumes DSAC as a special case. Our framework allows using an m-step Bellman operator for the critic update, and enables combining standard policy optimization methods with entropy regularization to instantiate the resulting actor objective. Theoretically, we prove that the proposed methods can guarantee convergence to the optimal regularized value function in the tabular setting. Empirically, we demonstrate that these methods can approach the performance of DQN on standard Atari games, and do so even without entropy regularization or explicit exploration.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Asad et al. (Thu,) studied this question.

synapsesocial.com/papers/68dc12d38a7d58c25ebb10ec https://doi.org/https://doi.org/10.48550/arxiv.2509.09838

Bookmark

View Full Paper