Multi-State TD Target for Model-Free Reinforcement Learning

Key Points

Key points are not available for this paper at this time.

Abstract

Temporal difference (TD) learning is a fundamental technique in reinforcement learning that updates value estimates for states or state-action pairs using a TD target. This target represents an improved estimate of the true value by incorporating both immediate rewards and the estimated value of subsequent states. Traditionally, TD learning relies on the value of a single subsequent state. We propose an enhanced multi-state TD (MSTD) target that utilizes the estimated values of multiple subsequent states. Building on this new MSTD concept, we develop complete actor-critic algorithms that include management of replay buffers in two modes, and integrate with deep deterministic policy optimization (DDPG) and soft actor-critic (SAC). Experimental results demonstrate that algorithms employing the MSTD target significantly improve learning performance compared to traditional methods.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Wang et al. (Sun,) studied this question.

synapsesocial.com/papers/68e685a5b6db64358760ee45 https://doi.org/https://doi.org/10.48550/arxiv.2405.16522

Bookmark

View Full Paper