Los puntos clave no están disponibles para este artículo en este momento.
A common class of Reinforcement Learning (RL) problems in nonstationary environments is related to the problems in which the environment model is altered among a finite set of possible distinct models. In certain problems, availability of predictions about abrupt model changes is feasible. Using Weighted Mixture Policy (WMP) is a recent approach to harness such predictions proactively prior change occurrence to increase the overall accrued rewards. However, in WMP approach presented in the literature, the optimal policies of all individual environment models are assumed to be known which results in a high sample complexity, since sufficient training samples are required to achieve individual optimal policies and then, the WMP is allowed to be utilized and new data are also required for its training. In this paper, the possibility of using WMP prior achieving the optimal individual policies and starting to train WMP while the individual policies are still being trained is investigated. In the cart-pole predictive reference tracking problem as the numerical experiment, it is shown that availability of optimal individual policies is not necessary for using WMP and some performance improvement of individual policies is sufficient for starting to use WMP. Utilizing WMP prior reaching to exact optimal policies leads to a significant improvement in sample complexity.
Pourshamsaei et al. (Thu,) studied this question.
Synapse has enriched 4 closely related papers on similar clinical questions. Consider them for comparative context: