Key points are not available for this paper at this time.
The standard paradigm of reinforcement learning (RL) is the Markov Decision Process (MDP) in which an agent learns to maximize the cumulative discounted rewards.The reward function in MDP is generally defined as the sum of multiple reward components, each designed to encapsulate a specific aspect of the expected policy.The discount factor γ ∈ [0, 1) decreases the future reward in the present value, which determines the effective time horizon for the agent.In the conventional MDP, all reward components are subject to the same discount factor regardless of their specific meanings.Although this convenient configuration simplifies the problem in the algorithm deployment, it sacrifices precision in defining the optimization problem and results in a temporal mismatch of rewards with diverse physical meanings.This paper proposes multi-discounting MDP (MDMDP), a novel model based on reward decomposition to solve the above problems.MDMDP allows practitioners to set separate discount factors for different reward components.This capability provides great flexibility in combining reward components at different timescales.Furthermore, this paper proposes a RL algorithm, multi-discounting Q-learning, to solve finite MDMDP.Moreover, we extend it to the deep RL version, including multi-discounting DQN for discrete action space tasks and multi-discounting actor-critic for continuous action space tasks.Experimental results demonstrate that the proposed methods improve flexibility and precision in modeling complex tasks, enhancing the alignment of the agent's policy with desired objectives.
Chen et al. (Mon,) studied this question.