Key points are not available for this paper at this time.
Most vision-language (VL) trackers rely on coarse-grained information from sentences to achieve multi-modal alignment. However, this information is insufficient for accurately describing the target in each frame due to the inherent ambiguity, summarization, and invariance of sentences, thereby making multi-modal alignment challenging. This paper introduces TTCTrack, a novel VL tracker that employs textual token classification to address this challenge. Specifically, we exploit the multi-modal cross-relations to classify textual tokens into various types and employ diverse operations on them, enabling multi-modal alignment in the tracking process. This enables the textual tokens to accurately describe the target and dynamically reflect the scene changes. Moreover, we introduce a dual-encoder structure that effectively handles the multi-modal input and fusion. Extensive experiments on four datasets demonstrate the effectiveness of our proposed tracking method.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhongjie Mao
Wuhan University
Yucheng Wang
University of Science and Technology of China
Xi Chen
Wuhan University
Wuhan University
Building similarity graph...
Analyzing shared references across papers
Loading...
Mao et al. (Mon,) studied this question.
synapsesocial.com/papers/68e7397eb6db6435876b29d6 — DOI: https://doi.org/10.1109/icassp48485.2024.10446122