Key points are not available for this paper at this time.
. We consider a subclass of \ (n\) -player stochastic games, in which players have their own internal state/action spaces while they are coupled through their payoff functions. It is assumed that players' internal chains are driven by independent transition probabilities. Moreover, players can receive only realizations of their payoffs, not the actual functions, and cannot observe each others' states/actions. For this class of games, we first show that finding a stationary Nash equilibrium (NE) policy without any assumption on the reward functions is intractable. However, for general reward functions, we develop polynomial-time learning algorithms based on dual averaging and dual mirror descent, which converge in terms of the averaged Nikaido–Isoda distance to the set of \ (\) -NE policies almost surely or in expectation. In particular, under extra assumptions on the reward functions such as social concavity, we derive polynomial upper bounds on the number of iterates to achieve an \ (\) -NE policy with high probability. Finally, we evaluate the effectiveness of the proposed algorithms in learning \ (\) -NE policies using numerical experiments for energy management in smart grids. Keywordsstochastic gamesstationary Nash equilibriumdual averagingdual mirror descentNikaido–Isoda functionlearning in gamessmart gridsMSC codes91A2691A1593A1693A1493E35
S. Rasoul Etesami (Fri,) studied this question.