Diffusion Mask-Driven Visual-language Tracking

Key Points

Key points are not available for this paper at this time.

Abstract

Most existing visual-language trackers greatly rely on the initial language descriptions on a target object to extract their multi-modal features. However, the initial language descriptions are often inaccurate in a highly time-varying video sequence and thus greatly deteriorate their tracking performance due to the low quality of extracted multi-modal features. To address this challenge, we propose a Diffusion Mask-Driven Visual-language Tracker (DMTrack) based on a diffusion model. Confronting the issue of low-quality multi-modal features due to inaccurate language descriptions, we leverage the diffusion model to capture high-quality semantic information from multi-modal features and transform it into target mask features. During the training phase, we further enhance the diffusion model's perception of pixel-level features by calculating the loss between the target mask features and the ground truth masks. Additionally, we perform joint localization of the target using both target mask features and visual features, instead of relying solely on multi-modal features for localization. Through extensive experiments on four tracking benchmarks (i.e., LaSOT, TNL2K, LaSOText, and OTB-Lang), we validate that our proposed Diffusion Mask-Driven Visual-language Tracker can improve the robustness and effectiveness of the model.

Bookmark

Diffusion Mask-Driven Visual-language Tracking

Key Points

Abstract

Cite This Study

Also Consider

Also Consider