What question did this study set out to answer?

The aim is to develop a framework to evaluate vision-language models in unstructured driving environments under challenging conditions.

June 13, 2026Open Access

Unstructured Scene Benchmark (USB): Which VLM Performs Better in Autonomous Driving?

Key Points

The aim is to develop a framework to evaluate vision-language models in unstructured driving environments under challenging conditions.
Developed a test framework including multiple-choice and subjective questions for VLM assessment.
Used seven open-source VLMs to evaluate their performance against multiple perturbations in unstructured scenes.
Implemented multi-dimensional scoring prompts to assess reasoned and accurate responses.
Evaluated models on multiple-choice questions with varied accuracy metrics, highlighting performance gaps.
Analyzed reasoning coherence and visual fidelity, with varying robustness across input perturbations.
Identified critical evaluation gaps in VLMs for autonomous driving applications.

Abstract

Abstract Vision-language models (VLMs) offer the potential for unified perception and language-guided decision-making in autonomous driving. However, existing benchmarks predominantly focus on structured road environments and high-quality imagery, leaving limited evidence on model performance (e.g., reasoning, explanation, and decision traceability) under unstructured scenes or degraded sensing conditions. This study develops a trustworthy test and evaluation framework to systematically assess VLM performance in these challenging contexts. Impromptu vision-language-action (VLA) samples are reorganized into six synchronized camera views augmented with vehicle state information, and twenty realistic input perturbations, covering illumination, weather, sensor reliability, and occlusion, are introduced for each scene. Moreover, six original non-open-ended questions are reformulated into traffic decision templates that combine structured choice sets with template-constrained free-text responses. Two types of evaluation methods are formulated. Multiple-choice questions (MCQs) are based on exact answer matching, aggregated with importance weighting according to expert rankings to prioritize planning tasks, and subjective questions (SQs) are graded using tailored, multi-dimensional LLM-based scoring prompts. Experiments are conducted to assesses their performance of seven open-source VLMs, including multiple-choice questions (MCQs) accuracy, reasoning coherence and visual fidelity of their responses to subjective questions (SQs), and their robustness to input perturbations. Overall, the proposed framework addresses a critical evaluation gap for supporting the deployment of VLMs in autonomous driving applications.

Bookmark

View Full Paper

Cite This Study

Zheng et al. (Mon,) studied this question.

synapsesocial.com/papers/6a2cf604faef96ed7f057ecd https://doi.org/https://doi.org/10.26599/commtr.2026.9640034

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark

View Full Paper