What does this research mean for the field?

An asynchronous batch inference framework integrating Vision-Language Model feedback into Reinforcement Learning enables lightweight models to achieve near-VLM performance in autonomous driving while sustaining real-time inference speeds of approximately 500 FPS. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The aim is to improve reinforcement learning for autonomous driving by integrating foundation models, particularly Vision-Language Models, to enhance efficiency and semantic understanding.

May 17, 2026Open Access

Found-RL: Foundation model-enhanced reinforcement learning via asynchronous VLM feedback for autonomous driving

Key Points

The aim is to improve reinforcement learning for autonomous driving by integrating foundation models, particularly Vision-Language Models, to enhance efficiency and semantic understanding.
Developed Found-RL, a platform utilizing asynchronous batch inference to integrate foundation models into RL workflows.
Implemented Value-Margin Regularization and Advantage-Weighted Action Guidance for expert-like suggestions.
Used high-throughput CLIP for reward shaping, addressing dynamic blindness with Conditional Contrastive Action Alignment.
Achieved near-VLM performance with a lightweight RL model (~500 FPS) compared to heavier billion-parameter VLMs.
Improved real-time inference capabilities while maintaining effective supervision and training mechanisms.

Abstract

Reinforcement Learning (RL) has emerged as a dominant paradigm for end-to-end autonomous driving (AD) with real-time inference. However, RL typically suffers from sample inefficiency and a lack of semantic interpretability in complex scenarios. To mitigate these limitations, Foundation Models (particularly, Vision-Language Models (VLMs)) can be integrated because they offer rich, context-aware knowledge. Yet still, deploying such computationally intensive models within high-frequency multi-environment RL training loops is severely hindered by prohibitive inference latency and the absence of unified integration platforms. To bridge this gap, we present Found-RL, a specialized platform tailored to leverage foundation models to efficiently enhance RL for AD. A core innovation of the proposed platform is its asynchronous batch inference framework, which decouples heavy VLM reasoning from the simulation loop. This design effectively resolves latency bottlenecks, supporting real-time or near-real-time RL learning from VLM feedback. Using the proposed platform, we introduce diverse supervision mechanisms to address domain-specific challenges: we first implement Value-Margin Regularization (VMR) and Advantage-Weighted Action Guidance (AWAG) to effectively distill expert-like VLM action suggestions into the RL policy. Furthermore, for dense supervision, we adopt high-throughput CLIP for reward shaping. We mitigate CLIP’s dynamic blindness and probability dilution via Conditional Contrastive Action Alignment, which conditions prompts on discretized speed/command and yields a normalized, margin-based bonus from context-specific action-anchor scoring. Found-RL delivers an end-to-end pipeline for fine-tuned VLM integration with modular support, and shows that a lightweight RL model with millions of parameters can achieve near-VLM performance compared with billion-parameter VLMs while sustaining real-time inference (~500 FPS). Code, data, and models will be publicly available at https://github.com/ys-qu/found-rl.

Ask AI

Helpful

Bookmark

View Full Paper