What question did this study set out to answer?

The aim is to enhance decision-making capabilities in autonomous driving using a semantic graph model.

April 16, 2026

SGVLM : Depth‐Integrated Semantic Scene Graph Fusion for Enhanced Autonomous Driving Decision‐Making

Key Points

The aim is to enhance decision-making capabilities in autonomous driving using a semantic graph model.
Developed the SGVLM architecture combining semantic graphs and vision-language models.
Utilized Depth-Anything-V2 for accurate inter-object distance estimation.
Implemented a two-layer Graph Attention Network for feature aggregation.
Applied Low-Rank Adaptation to improve computational efficiency.
Validated on the DriveLM-nuScenes benchmark with safety-critical data.
Achieved a 25.9% improvement in BLEU-4 scores over the baseline.
Achieved an 18.6% improvement in ROUGE-L scores compared to InternVL4Drive-v2.
Attained 94.56% accuracy in collision-warning decision tasks.

Abstract

ABSTRACT Autonomous driving decision‐making requires a deep semantic understanding of traffic scenes. In this paper, we propose the SGVLM (Semantic Graph Vision‐Language Model) architecture: a vision‐language model that enhances autonomous driving decision‐making through depth‐integrated semantic scene graph fusion. Key objects are represented as nodes (category, state) and spatial‐semantic relations as edges, enriched with pixel‐wise depth estimates from Depth‐Anything‐V2 to capture accurate inter‐object distances. These structured graph features are aggregated via a two‐layer Graph Attention Network and projected into the FastVLM's FastViTHD feature space. A cross‐modal triplet fusion layer then jointly integrates graph embeddings, visual features, and natural‐language queries. Crucially, to ensure computational efficiency without compromising the generalization power of the large‐scale backbone, we employ Low‐Rank Adaptation (LoRA), which significantly reduces the number of trainable parameters and accelerates convergence while maintaining pre‐trained performance. Empirical validation on the DriveLM‐nuScenes benchmark demonstrates that SGVLM₇B achieves relative improvements of 25. 9% in BLEU‐4 and 18. 6% in ROUGE‐L over the InternVL4Drive‐v2 baseline, and attains 94. 56% accuracy on collision‐warning decision tasks in our TTSG‐data safety‐critical scenarios. These results confirm that depth‐integrated semantic scene graph fusion substantially enhances the model's ability to generate actionable driving decisions under complex traffic conditions.

Bookmark

SGVLM : Depth‐Integrated Semantic Scene Graph Fusion for Enhanced Autonomous Driving Decision‐Making

Key Points

Abstract

Cite This Study