CLIP-BEV: A Late-Fusion Framework for Multimodal Scene Understanding Using Vision Language Models | Synapse