Key points are not available for this paper at this time.
Stochastic games model the strategic interactions between two or more players that occur in a sequence of stages. In this paper we focus on computing the average reward Nash equilibrium (ARNE) of a nonzero-sum stochastic game when the transition probabilities of the game and reward structure of the players are unknown. We note that the current state-of-the-art reinforcement learning (RL) algorithms that compute the ARNE of nonzero-sum stochastic games requires solving a matrix game corresponding to each state of the game at every iteration of the algorithm, which is PPAD Polynomial Parity Arguments on Directed graphs.PPAD-def-complete and incurs a memory complexity that is exponential in the number of players. In this paper, we use temporal difference error minimization and stochastic approximation to develop a scalable RL algorithm to compute an ARNE of nonzero-sum stochastic games. We prove the convergence of our algorithm to an ARNE. We evaluate the performance of our algorithm using an attacker-defender game modeled on a real-world ransomware dataset.
Sahabandu et al. (Tue,) studied this question.