The growing penetration of variable renewable energy sources transforms power grid dispatch into a high-dimensional, stochastic, and multi-agent decision-making problem that challenges both classical optimization and standard Reinforcement Learning (RL) methods. Traditional RL policies, typically parameterized as unimodal Gaussians, lack the expressiveness to capture the multimodal action distributions that arise when multiple feasible dispatch strategies coexist, while diffusion-based generative policies achieve expressiveness at the cost of prohibitively many iterative denoising steps during inference. We propose Multi-Agent Flow-based Q-Learning (MAFQL), a framework that addresses this expressiveness–efficiency tradeoff by integrating conditional flow matching with conservative Q-learning under a Centralized Training with Decentralized Execution (CTDE) architecture. The framework consists of a unified training pipeline that combines four learning objectives: behavior cloning, flow matching, conservative Q-learning, and distillation. This allows for expressive policy generation through only 1–5 ODE integration steps. Measured per-agent inference latencies below 8ms (P99) are achieved on both GPU and CPU hardware, which is compatible with the response requirements of automatic generation control. We formulate the dispatch task as a Dec-POMDP over three physically grounded control zones derived from the RTE network topology and evaluate MAFQL on the IEEE 118-bus and 14-bus systems in the Grid2Op simulator. Empirical results show that MAFQL CTDE substantially outperforms all tested baseline methods on the 118-bus system under a composite multi-objective reward function and that it demonstrates initial cross-scale generalizability on the 14-bus system. The decentralized execution variant consistently outperforms centralized execution, consistent with the hypothesis that distillation facilitates effective knowledge transfer. At the end of the paper we discuss current limitations such as the absence of ablation studies, end-to-end latency measurements, and formal safety guarantees, then outline directions for addressing them.
Te et al. (Sun,) studied this question.