Scene understanding is a fundamental task in autonomous driving, requiring effective integration of semantic and geometric information from heterogeneous sensors. Although vision–language models (VLMs) provide powerful semantic representations, their integration with LiDAR-based geometric perception remains challenging. This paper proposes a multimodal late-fusion framework for multi-label scene classification that combines semantic embeddings extracted from camera images using a frozen CLIP (ViT-B/32) encoder with geometric features derived from LiDAR Bird’s-Eye-View (BEV) representations. To improve multimodal compatibility, modality-specific adaptation networks are employed to refine visual and geometric features before fusion. The proposed framework was evaluated on an annotated subset of the nuScenes dataset containing synchronized camera–LiDAR samples and nine scene-level labels. Experimental results show that the proposed late-fusion architecture outperforms both unimodal and early-fusion baselines, achieving a Hamming Accuracy of 0.950, a Micro-F1 score of 0.925, and a mean Average Precision (mAP) of 0.908. Additional experiments using a CLIP-based early-fusion baseline demonstrate that the observed performance gains are primarily attributable to the proposed modality-specific refinement and late-fusion strategy rather than the visual encoder alone. These findings indicate that modality-aware late fusion of pretrained semantic representations and LiDAR geometric information provides an effective and scalable solution for multimodal perception in autonomous driving.
Daraee et al. (Sat,) studied this question.