ABSTRACT Autonomous driving decision‐making requires a deep semantic understanding of traffic scenes. In this paper, we propose the SGVLM (Semantic Graph Vision‐Language Model) architecture: a vision‐language model that enhances autonomous driving decision‐making through depth‐integrated semantic scene graph fusion. Key objects are represented as nodes (category, state) and spatial‐semantic relations as edges, enriched with pixel‐wise depth estimates from Depth‐Anything‐V2 to capture accurate inter‐object distances. These structured graph features are aggregated via a two‐layer Graph Attention Network and projected into the FastVLM's FastViTHD feature space. A cross‐modal triplet fusion layer then jointly integrates graph embeddings, visual features, and natural‐language queries. Crucially, to ensure computational efficiency without compromising the generalization power of the large‐scale backbone, we employ Low‐Rank Adaptation (LoRA), which significantly reduces the number of trainable parameters and accelerates convergence while maintaining pre‐trained performance. Empirical validation on the DriveLM‐nuScenes benchmark demonstrates that SGVLM₇B achieves relative improvements of 25. 9% in BLEU‐4 and 18. 6% in ROUGE‐L over the InternVL4Drive‐v2 baseline, and attains 94. 56% accuracy on collision‐warning decision tasks in our TTSG‐data safety‐critical scenarios. These results confirm that depth‐integrated semantic scene graph fusion substantially enhances the model's ability to generate actionable driving decisions under complex traffic conditions.
Han et al. (Wed,) studied this question.