This paper presents UM3D, an end-to-end unsupervised domain adaptation framework for monocular 3D object detection. Monocular 3D object detection is appealing due to its low cost, yet it suffers from limited depth cues and poor cross-domain generalization when labeled data are scarce. Existing Pseudo-LiDAR methods require supervised training and propagate depth estimation errors to downstream detection, while current unsupervised domain adaptation (UDA) approaches exploit only a single modality and lack effective pseudo-label quality control. UM3D addresses these limitations through two key designs: (1) a quality-aware pseudo-label generation strategy with object-level random scaling and a memory bank refinement mechanism; and (2) an end-to-end differentiable pipeline that integrates multimodal fusion of image and Pseudo-LiDAR features with a multi-network consistency loss, which jointly optimizes depth estimation and 3D detection via backpropagation. Notably, the entire pipeline requires only a single monocular camera at inference; the Pseudo-LiDAR representation is generated internally from the same image, and thus the multimodal fusion integrates image and Pseudo-LiDAR features without requiring additional sensors. Extensive experiments across KITTI, nuScenes, Waymo, and Lyft demonstrate that UM3D generally outperforms existing UDA methods. In particular, a 19.30% relative APBEV improvement is achieved under easy conditions through end-to-end joint training compared to independent depth estimation, and up to 76.81% of the domain gap is closed on the WOD → KITTI benchmark.
Jiang et al. (Fri,) studied this question.