What question did this study set out to answer?

This research aims to improve unsupervised visual object tracking by leveraging pretrained text-to-image diffusion models.

May 29, 2026

Leveraging Text-to-Image Diffusion Models for Unsupervised Visual Object Tracking

Puntos clave

This research aims to improve unsupervised visual object tracking by leveraging pretrained text-to-image diffusion models.
Utilized text-to-image diffusion models for the tracking task via a cross-attention mechanism.
Developed two components: an initial prompt learner for identifying the target object and an online prompt updater for refining the prompt based on motion information.
Evaluated the method, named Diff-Tracking, across six challenging tracking datasets.
Diff-Tracking achieved superior performance on all evaluated tracking datasets compared to existing unsupervised trackers.
Quantitative metrics indicate a significant improvement in tracking accuracy and consistency across video frames.

Resumen

Unsupervised visual object tracking is a challenging task that requires following arbitrary targets in videos without training on ground-truth annotations. Despite considerable progress, existing state-of-the-art unsupervised trackers often struggle in scenarios that demand fine-grained understanding of semantic and visual structural information within video frames. Text-to-image diffusion models are well known for their ability to generate images that accurately reflect the semantics and structures described in the input prompt, demonstrating a strong grasp of visual semantics and structures. Building on this capability, we approach the unsupervised tracking from a new perspective by exploiting the rich semantic knowledge encoded in pretrained text-to-image diffusion models. To adapt the diffusion models, which are originally developed for image generation, to the tracking task, we reinterpret the models as a bridge between text and image modalities. This connection is realized through the cross-attention mechanism: when both text and an image are input into the models, they highlight the regions of the image that are semantically aligned with the text in the cross attention maps. We therefore learn a prompt that represents the tracking target and activates its corresponding region in the cross-attention map for each frame, which enables object tracking with the diffusion model. Specifically, our method Diff-Tracking is composed of two main components: an initial prompt learner and an online prompt updater. The initial prompt learner generates a prompt that captures the target object in the first frame, allowing the diffusion model to identify the target. The online prompt updater refines the prompt based on motion information, enabling consistent tracking across video frames. We evaluate our approach on six challenging tracking datasets, showing that Diff-Tracking achieves strong performance compared to existing unsupervised trackers.

Me gusta

Guardar