What question did this study set out to answer?

The aim is to enhance scene understanding in autonomous driving by effectively integrating semantic and geometric information.

June 17, 2026Open Access

CLIP-BEV: A Late-Fusion Framework for Multimodal Scene Understanding Using Vision Language Models

Key Points

The aim is to enhance scene understanding in autonomous driving by effectively integrating semantic and geometric information.
Proposed a late-fusion framework combining CLIP semantic embeddings and LiDAR BEV features.
Employed modality-specific adaptation networks to refine features before fusion.
Evaluated on the nuScenes dataset with synchronized camera–LiDAR samples.
Achieved Hamming Accuracy of 0.950, Micro-F1 score of 0.925, and mean Average Precision (mAP) of 0.908.
Outperformed unimodal and early-fusion baselines significantly.
Improvement attributed to modality-specific refinement and late-fusion strategy.

Abstract

Scene understanding is a fundamental task in autonomous driving, requiring effective integration of semantic and geometric information from heterogeneous sensors. Although vision–language models (VLMs) provide powerful semantic representations, their integration with LiDAR-based geometric perception remains challenging. This paper proposes a multimodal late-fusion framework for multi-label scene classification that combines semantic embeddings extracted from camera images using a frozen CLIP (ViT-B/32) encoder with geometric features derived from LiDAR Bird’s-Eye-View (BEV) representations. To improve multimodal compatibility, modality-specific adaptation networks are employed to refine visual and geometric features before fusion. The proposed framework was evaluated on an annotated subset of the nuScenes dataset containing synchronized camera–LiDAR samples and nine scene-level labels. Experimental results show that the proposed late-fusion architecture outperforms both unimodal and early-fusion baselines, achieving a Hamming Accuracy of 0.950, a Micro-F1 score of 0.925, and a mean Average Precision (mAP) of 0.908. Additional experiments using a CLIP-based early-fusion baseline demonstrate that the observed performance gains are primarily attributable to the proposed modality-specific refinement and late-fusion strategy rather than the visual encoder alone. These findings indicate that modality-aware late fusion of pretrained semantic representations and LiDAR geometric information provides an effective and scalable solution for multimodal perception in autonomous driving.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper