What question did this study set out to answer?

The aim is to create a decision-making framework that efficiently manages power grid dispatch with a high level of renewable energy integration.

June 3, 2026Open Access

MAFQL: Multi-Agent Flow-Based Q-Learning for Efficient Power Grid Dispatch with High Renewable Penetration

Key Points

The aim is to create a decision-making framework that efficiently manages power grid dispatch with a high level of renewable energy integration.
Developed Multi-Agent Flow-based Q-Learning (MAFQL) framework combining multiple learning objectives.
Implemented Centralized Training with Decentralized Execution (CTDE) architecture.
Formulated dispatch tasks using Dec-POMDP over power grid control zones.
MAFQL substantially outperformed baseline methods on the 118-bus system with multi-objective reward.
Achieved per-agent inference latencies below 8ms, ensuring compatibility with automatic generation control.
Decentralized execution consistently surpassed centralized execution, supporting effective knowledge transfer.

Abstract

The growing penetration of variable renewable energy sources transforms power grid dispatch into a high-dimensional, stochastic, and multi-agent decision-making problem that challenges both classical optimization and standard Reinforcement Learning (RL) methods. Traditional RL policies, typically parameterized as unimodal Gaussians, lack the expressiveness to capture the multimodal action distributions that arise when multiple feasible dispatch strategies coexist, while diffusion-based generative policies achieve expressiveness at the cost of prohibitively many iterative denoising steps during inference. We propose Multi-Agent Flow-based Q-Learning (MAFQL), a framework that addresses this expressiveness–efficiency tradeoff by integrating conditional flow matching with conservative Q-learning under a Centralized Training with Decentralized Execution (CTDE) architecture. The framework consists of a unified training pipeline that combines four learning objectives: behavior cloning, flow matching, conservative Q-learning, and distillation. This allows for expressive policy generation through only 1–5 ODE integration steps. Measured per-agent inference latencies below 8ms (P99) are achieved on both GPU and CPU hardware, which is compatible with the response requirements of automatic generation control. We formulate the dispatch task as a Dec-POMDP over three physically grounded control zones derived from the RTE network topology and evaluate MAFQL on the IEEE 118-bus and 14-bus systems in the Grid2Op simulator. Empirical results show that MAFQL CTDE substantially outperforms all tested baseline methods on the 118-bus system under a composite multi-objective reward function and that it demonstrates initial cross-scale generalizability on the 14-bus system. The decentralized execution variant consistently outperforms centralized execution, consistent with the hypothesis that distillation facilitates effective knowledge transfer. At the end of the paper we discuss current limitations such as the absence of ablation studies, end-to-end latency measurements, and formal safety guarantees, then outline directions for addressing them.

MAFQL: Multi-Agent Flow-Based Q-Learning for Efficient Power Grid Dispatch with High Renewable Penetration

Key Points

Abstract

Cite This Study