Rapid and precise perception and decision-making across different types of sensors are ever more demanding for real-time autonomous navigation in rapidly changing environments. We propose FusedVisionNet, a multi-modal, real-time vision, depth, and LiDAR data fused navigation system based on transformers. Unlike traditional convolutional single-modality or multi-modality systems, FusedVisionNet employs a cross-attention transformer backbone that combines spatial and semantic information extracted from several modalities, which enables us to comprehend complex scenes better. The model also features a multi-scale fusion framework that captures shared feature along with individual characteristics unique to each captured modality, deepening coherent representations. Evaluation on benchmark datasets like KITTI and nuScenes reveals that FusedVisionNet surpasses the state-of-the- art benchmarks in object detection, path planning, and obstacle avoidance while maintaining the lower latency required for real-time use. Realtime applications depend on low latency; ablation studies demonstrate the efficacy of each modality and the union approach. Through direct enhancement in reliably contested weather and lighting conditions, FusedVisionNet achieves superiority in diverse urban and off-road scenarios. The developed model marks a critical advance in robust and dependable autonomous navigation systems tailored specifically for real- world use and illuminates prospects for transformer-based multi-modal hierarchical fusion architectures in future autonomously-vehicular technologies.
Van et al. (Mon,) studied this question.