Key points are not available for this paper at this time.
We investigate the decentralized nonparametric policy evaluation problem within reinforcement learning (RL), focusing on scenarios where multiple agents collaborate to learn the state-value function using sampled state transitions and privately observed rewards. Our approach centers on a regression-based multistage iteration technique employing infinite-dimensional gradient descent (GD) within a reproducing kernel Hilbert space (RKHS). To make computation and communication more feasible, we employ Nyström approximation to project this space into a finite-dimensional one. We establish statistical error bounds to describe the convergence of value function estimation, marking the first instance of such analysis within a fully decentralized nonparametric framework. We compare the regression-based method to the kernel temporal difference (TD) method in some numerical studies.
Liu et al. (Tue,) studied this question.