Abstract Vision-language models (VLMs) offer the potential for unified perception and language-guided decision-making in autonomous driving. However, existing benchmarks predominantly focus on structured road environments and high-quality imagery, leaving limited evidence on model performance (e.g., reasoning, explanation, and decision traceability) under unstructured scenes or degraded sensing conditions. This study develops a trustworthy test and evaluation framework to systematically assess VLM performance in these challenging contexts. Impromptu vision-language-action (VLA) samples are reorganized into six synchronized camera views augmented with vehicle state information, and twenty realistic input perturbations, covering illumination, weather, sensor reliability, and occlusion, are introduced for each scene. Moreover, six original non-open-ended questions are reformulated into traffic decision templates that combine structured choice sets with template-constrained free-text responses. Two types of evaluation methods are formulated. Multiple-choice questions (MCQs) are based on exact answer matching, aggregated with importance weighting according to expert rankings to prioritize planning tasks, and subjective questions (SQs) are graded using tailored, multi-dimensional LLM-based scoring prompts. Experiments are conducted to assesses their performance of seven open-source VLMs, including multiple-choice questions (MCQs) accuracy, reasoning coherence and visual fidelity of their responses to subjective questions (SQs), and their robustness to input perturbations. Overall, the proposed framework addresses a critical evaluation gap for supporting the deployment of VLMs in autonomous driving applications.
Zheng et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: