4D millimeter-wave radar provides high-precision ranging capability and exhibits strong robustness under adverse weather and low-visibility conditions, but its point clouds are relatively sparse and suffer from severe elevation-angle measurement noise. Monocular cameras, by contrast, provide rich semantic information and high recall, yet are fundamentally limited by scale ambiguity. To exploit the complementary characteristics of these two sensors, this paper proposes a radar-camera fusion 3D multi-object tracking framework that does not rely on complex 3D annotated data. First, on the radar signal-processing side, a Gaussian distribution-based adaptive angle compression method and IMU-based velocity compensation are introduced to effectively suppress measurement noise, and an improved DBSCAN clustering scheme with recursive cluster splitting and historical static-box guidance is employed to generate high-quality radar detections. Second, a disparity-domain metric depth recovery method is proposed. This method uses filtered radar points as sparse metric anchors, performs robust fitting with RANSAC, and applies Kalman filtering for temporal smoothing, thereby converting the relative depth output of the visual foundation model Depth Anything V2 into metric depth. Finally, a hierarchical fusion strategy is designed at both the detection and tracking levels to achieve stable cross-modal state association. Experimental results on a self-collected dataset show that the proposed method achieves an overall MOTA of 77.93%, outperforming single-modality baselines and other comparison methods by 11 to 31 percentage points. This study provides an effective solution for low-cost and robust environment perception in complex dynamic scenarios.
Xie et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: