What question did this study set out to answer?

The aim is to improve exploration for reinforcement learning agents while adhering to safety constraints during navigation tasks.

March 2, 2026Open Access

Enhancing Safe Exploration Through Subgoal Guidance

Read Full Paperexternally

Key Points

The aim is to improve exploration for reinforcement learning agents while adhering to safety constraints during navigation tasks.
Introduced SG-Safe, a learning-based method for autonomous navigation.
Decomposed complex tasks into smaller subgoals to enhance exploration.
Employed two policies: a subgoal policy for generating intermediate goals and a safe policy for guided exploration.
Evaluated effectiveness using Safety Gym benchmark and autonomous vehicle navigation simulations.
Achieved an average success rate of 0.87 with a collision rate of 0.03 on Safety Gym.
On POLAMP, SG-Safe attained a success rate of 0.90 with a collision rate of 0.03.
Outperformed best single-policy safe RL baseline and planning-based Lyapunov-RRT baseline in both success rates and collision rates.

Abstract

Reinforcement learning is a widely used approach for autonomous navigation, but it often struggles to reach distant, long-horizon goals under safety constraints. The primary reason for this suboptimal performance is that safety requirements significantly degrade exploration capabilities during training, limiting the agent’s ability to discover feasible long-horizon policies. To address this issue, we introduce SG-Safe, a novel learning-based method that decomposes complex navigation tasks into smaller sub-problems using intermediate goals while respecting cumulative safety constraints. Our approach employs two coupled policies: a subgoal policy that generates intermediate subgoals, and a safe policy that leverages these subgoals to guide exploration toward the final objective. This hierarchical structure improves exploration without compromising safety and remains effective even under partial observability. We evaluate our method on autonomous vehicle navigation in simulation and using the Safety Gym benchmark. The experimental results demonstrate a strong safety–performance trade-off under partial observability. On Safety Gym, SG-Safe achieves an average success rate of 0.87 with a collision rate of 0.03, improving over the best end-to-end learning-based safe RL baseline (SR 0.40, CR 0.05). On POLAMP, SG-Safe attains SR 0.90 with CR 0.03, while the best single-policy safe RL baseline reaches at most SR ≤ 0.36 under the same observation settings. Moreover, on POLAMP, SG-Safe also surpasses the planning-based Lyapunov-RRT baseline that has access to full environment information, achieving higher success (SR 0.90 vs. 0.34) and a lower collision rate (CR 0.03 vs. 0.51), while eliminating online planning at deployment (PT = 0 s vs. ≈10–30 s per episode).

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Gregory Gorbov

Independent University of Moscow

Aleksandr Panov

Cognitive Research (United States)

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Enhancing Safe Exploration Through Subgoal Guidance

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider