Key points are not available for this paper at this time.
Most vision-language (VL) trackers rely on coarse-grained information from sentences to achieve multi-modal alignment. However, this information is insufficient for accurately describing the target in each frame due to the inherent ambiguity, summarization, and invariance of sentences, thereby making multi-modal alignment challenging. This paper introduces TTCTrack, a novel VL tracker that employs textual token classification to address this challenge. Specifically, we exploit the multi-modal cross-relations to classify textual tokens into various types and employ diverse operations on them, enabling multi-modal alignment in the tracking process. This enables the textual tokens to accurately describe the target and dynamically reflect the scene changes. Moreover, we introduce a dual-encoder structure that effectively handles the multi-modal input and fusion. Extensive experiments on four datasets demonstrate the effectiveness of our proposed tracking method.
Mao et al. (Mon,) studied this question.