What type of study is this?

This is a Quantitative Study study (also classified as: Experimental Study).

October 15, 2025Open Access

Flow-Based Policy for Online Reinforcement Learning

Key Points

FlowRL aligns flow optimization with reinforcement learning objectives, enabling improved policy learning.
Empirical evaluations on DMControl and Humanoidbench show FlowRL achieves competitive benchmarks in online reinforcement learning.
The framework utilizes a state-dependent velocity field to model policies, generating actions from noise.
By bounding the Wasserstein-2 distance, FlowRL maintains proximity to an optimal behavior policy derived from the replay buffer.

Abstract

We present FlowRL, a novel framework for online reinforcement learning that integrates flow-based policy representation with Wasserstein-2-regularized optimization. We argue that in addition to training signals, enhancing the expressiveness of the policy class is crucial for the performance gains in RL. Flow-based generative models offer such potential, excelling at capturing complex, multimodal action distributions. However, their direct application in online RL is challenging due to a fundamental objective mismatch: standard flow training optimizes for static data imitation, while RL requires value-based policy optimization through a dynamic buffer, leading to difficult optimization landscapes. FlowRL first models policies via a state-dependent velocity field, generating actions through deterministic ODE integration from noise. We derive a constrained policy search objective that jointly maximizes Q through the flow policy while bounding the Wasserstein-2 distance to a behavior-optimal policy implicitly derived from the replay buffer. This formulation effectively aligns the flow optimization with the RL objective, enabling efficient and value-aware policy learning despite the complexity of the policy class. Empirical evaluations on DMControl and Humanoidbench demonstrate that FlowRL achieves competitive performance in online reinforcement learning benchmarks.

Flow-Based Policy for Online Reinforcement Learning

Key Points

Abstract

Cite This Study