The Q-learning algorithm is well known for its convergence guarantees to the optimal policy in finite-state environments. In this paper, we investigate its limitations in countable infinite state spaces – a setting common in real-world problems. To this end, we introduce a simple queueing model, based on a load balancing problem, with a countably infinite state space. In this model, a dispatcher assigns incoming jobs to one of two queues by choosing between two possible actions: ''red'' and ''green''. The ''red'' action leads to transient behavior, whereas the ''green'' action ensures stability. Our main result shows that, under certain parameter conditions, Q-learning exhibits instability and fails to converge to the optimal policy. Our findings reveal a critical gap in the theoretical understanding of model-free Reinforcement Learning methods in infinite domains. Numerical experiments illustrate that the transience also occurs with decreasing stepsizes that satisfy the usual Robbins-Monro conditions.
Ayestaa et al. (Tue,) studied this question.