What type of study is this?

This is a Quantitative Study study.

September 17, 2025Open Access

Text-Guided Spatio-Temporal 2D and 3D Data Fusion for Multi-Object Tracking with RegionCLIP

Key Points

The framework enhances 3D multi-object tracking accuracy by effectively incorporating semantic information.
Significant improvements include a 0.83% increase in association accuracy and a reduction of ID switches by 16.7%.
The introduction of modules like the target semantic matching module helps filter unreliable tracking regions.
Evaluation on the KITTI dataset demonstrates the framework's effectiveness in challenging real-world driving scenarios.

Abstract

3D Multi-Object Tracking (3D MOT) is a critical task in autonomous systems, where accurate and robust tracking of multiple objects in dynamic environments is essential. Traditional approaches primarily rely on visual or geometric features, often neglecting the rich semantic information available in textual modalities. In this paper, we propose Text-Guided 3D Multi-Object Tracking (TG3MOT), a novel framework that incorporates Vision-Language Models (VLMs) into the YONTD architecture to improve 3D MOT performance. Our framework leverages RegionCLIP, a multimodal open-vocabulary detector, to achieve fine-grained alignment between image regions and textual concepts, enabling the incorporation of semantic information into the tracking process. To address challenges such as occlusion, blurring, and ambiguous object appearances, we introduce the Target Semantic Matching Module (TSM), which quantifies the uncertainty of semantic alignment and filters out unreliable regions. Additionally, we propose the 3D Feature Exponential Moving Average Module (3D F-EMA) to incorporate temporal information, improving robustness in noisy or occluded scenarios. Furthermore, the Gaussian Confidence Fusion Module (GCF) is introduced to weight historical trajectory confidences based on temporal proximity, enhancing the accuracy of trajectory management. We evaluate our framework on the KITTI dataset and compare it with the YONTD baseline. Extensive experiments demonstrate that although the overall HOTA gain of TG3MOT is modest (+0.64%), our method achieves substantial improvements in association accuracy (+0.83%) and significantly reduces ID switches (−16.7%). These improvements are particularly valuable in real-world autonomous driving scenarios, where maintaining consistent trajectories under occlusion and ambiguous appearances is crucial for downstream tasks such as trajectory prediction and motion planning. The code will be made publicly available.

Text-Guided Spatio-Temporal 2D and 3D Data Fusion for Multi-Object Tracking with RegionCLIP

Key Points

Abstract

Cite This Study

Also Consider

Also Consider