Key points are not available for this paper at this time.
In the advancing domain of autonomous driving, this research focuses on enhancing 3D Multi-Object Tracking (3D-MOT). Pedestrians are particularly vulnerable in urban environments, and robust tracking methodologies are required to understand their movements. Prevalent Tracking-By-Detection (TBD) frameworks often underutilize the rich visual data from sensors such as cameras. This study leverages the advanced visual foundation model, DINOv2, to refine the TBD framework by incorporating camera modality, thereby improving pedestrian tracking consistency and overall 3D-MOT performance. The proposed DINO-MOT framework is the first application of DINOv2 for enhancing 3D-MOT through pedestrian Re-Identification (Re-ID), and Score Filter Ceiling is implemented to prevent premature exclusion of low-confidence 3D detections during tracking association. Furthermore, utilization of DINOv2 as a feature extractor within the DINO-MOT framework reduces pedestrian ID switches by up to 12.3%. Achieving AMOTA of 76.3% on the nuScenes test dataset, DINO-MOT has set a new benchmark in the 3D MOT literature with an improvement of 0.5%, securing the top rank on the leaderboard. Furthermore, this research paves the potential of applying a visual foundation model to improve the existing TBD framework, to enhance 3D-MOT in autonomous driving.
Lee et al. (Mon,) studied this question.