In recent years, the demand for both high accuracy and real-time performance in 3D object detection has increased alongside the advancement of autonomous driving technology. While multimodal methods that integrate LiDAR and camera data have demonstrated high accuracy, these methods often have high computational costs and latency. To address these issues, we propose an efficient 3D object detection network that integrates three key components: a DepthWise Lightweight Encoder (DWLE) module for efficient feature extraction, an Efficient LiDAR Image Fusion (ELIF) module that combines channel attention with cross-modal feature interaction, and a Mixture of CNN and Point Transformer (MCPT) module for capturing rich spatial contextual information. Experimental results on the KITTI dataset demonstrate that our proposed method outperforms existing approaches by achieving approximately 0.6% higher 3D mAP, 7.6% faster inference speed, and 17.0% fewer parameters. These results highlight the effectiveness of our approach in balancing accuracy, speed, and model size, making it a promising solution for real-time applications in autonomous driving.
Sakai et al. (Thu,) studied this question.