What question did this study set out to answer?

May 6, 2026Open Access

Unsupervised Domain Adaptation with Multimodal Fusion for Monocular 3D Object Detection

Key Points

To develop an effective end-to-end unsupervised domain adaptation framework for monocular 3D object detection.
Introduced UM3D framework focusing on quality-aware pseudo-label generation.
Implemented multimodal fusion of image and Pseudo-LiDAR features.
Utilized a multi-network consistency loss for joint optimization.
Achieved a 19.30% relative APBEV improvement under easy conditions.
Closed up to 76.81% of the domain gap on the WOD → KITTI benchmark.

Abstract

This paper presents UM3D, an end-to-end unsupervised domain adaptation framework for monocular 3D object detection. Monocular 3D object detection is appealing due to its low cost, yet it suffers from limited depth cues and poor cross-domain generalization when labeled data are scarce. Existing Pseudo-LiDAR methods require supervised training and propagate depth estimation errors to downstream detection, while current unsupervised domain adaptation (UDA) approaches exploit only a single modality and lack effective pseudo-label quality control. UM3D addresses these limitations through two key designs: (1) a quality-aware pseudo-label generation strategy with object-level random scaling and a memory bank refinement mechanism; and (2) an end-to-end differentiable pipeline that integrates multimodal fusion of image and Pseudo-LiDAR features with a multi-network consistency loss, which jointly optimizes depth estimation and 3D detection via backpropagation. Notably, the entire pipeline requires only a single monocular camera at inference; the Pseudo-LiDAR representation is generated internally from the same image, and thus the multimodal fusion integrates image and Pseudo-LiDAR features without requiring additional sensors. Extensive experiments across KITTI, nuScenes, Waymo, and Lyft demonstrate that UM3D generally outperforms existing UDA methods. In particular, a 19.30% relative APBEV improvement is achieved under easy conditions through end-to-end joint training compared to independent depth estimation, and up to 76.81% of the domain gap is closed on the WOD → KITTI benchmark.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper