March 3, 2026Open Access

Stackelberg Game-Theoretic Safe MARL With Bilevel Control for Autonomous Driving

Key Points

DiStaK-C51 achieves improved safety metrics while maintaining high task performance and stable learning dynamics.
Evaluation on merge and roundabout benchmarks shows effectiveness in risk management for autonomous driving scenarios.
The implementation employs a chance-constrained framework for estimating safety costs in driving interactions, enhancing decision-making under uncertainty.
The theoretical model indicates contraction properties of the screened safe Stackelberg Bellman operator, ensuring robustness in RL applications.

Abstract

Ensuring safety in interactive autonomous driving remains a core challenge for reinforcement learning (RL), since agents must act under uncertainty and rare but critical events (e.g., collisions) while respecting traffic rules such as yielding and right-of-way. To address these challenges, we propose DiStaK, a distributional Stackelberg RL framework that models driving interactions as a bilevel leader–follower game. A practical discrete instantiation, DiStaK-C51, augments the safety layer with a C51-based cost head to estimate the full distribution of safety costs and constructs chance-constrained admissible action sets via cumulative distribution (CDF) thresholding. To improve efficiency, DiStaK-C51 replaces exhaustive joint-action Stackelberg enumeration with a retriever–refiner Top-K’/Top-K selection rule: a lightweight retriever produces a small candidate list, chance-constraint screening filters unsafe actions, and a final Top-K shortlist supports critic-based refinement. The follower selects a risk-aware best response using Q2 - λ2C2 with an adaptive dual update on λ2, and leader actions can be screened based on the induced interaction outcome, with a relaxation fallback to avoid deadlock when estimated safe sets are empty. We evaluate DiStaK-C51 on standard two-vehicle merge and roundabout benchmarks, where it achieves substantially improved safety metrics while maintaining strong task performance and stable learning dynamics. We also provide theoretical analysis showing that the (fallback-augmented) screened safe Stackelberg Bellman operator is a contraction and that Top-K shortlisting and distributional projection yield an explicit ϵ-neighborhood bound. Finally, we outline a practical extension to multi-vehicle traffic via rule-based role assignment and a horizontal two-level Stackelberg expansion, while comprehensive multi-vehicle evaluation is left for future work.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Qu et al. (Thu,) studied this question.

synapsesocial.com/papers/69a75f3ec6e9836116a2a7a7 https://doi.org/https://doi.org/10.1109/access.2026.3659797

Bookmark

View Full Paper