March 18, 2024Open Access

Textual Tokens Classification for Multi-Modal Alignment in Vision-Language Tracking

Key Points

Key points are not available for this paper at this time.

Abstract

Most vision-language (VL) trackers rely on coarse-grained information from sentences to achieve multi-modal alignment. However, this information is insufficient for accurately describing the target in each frame due to the inherent ambiguity, summarization, and invariance of sentences, thereby making multi-modal alignment challenging. This paper introduces TTCTrack, a novel VL tracker that employs textual token classification to address this challenge. Specifically, we exploit the multi-modal cross-relations to classify textual tokens into various types and employ diverse operations on them, enabling multi-modal alignment in the tracking process. This enables the textual tokens to accurately describe the target and dynamically reflect the scene changes. Moreover, we introduce a dual-encoder structure that effectively handles the multi-modal input and fusion. Extensive experiments on four datasets demonstrate the effectiveness of our proposed tracking method.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Zhongjie Mao

Wuhan University

Yucheng Wang

University of Science and Technology of China

Xi Chen

Wuhan University

Actions

Institutions

Wuhan University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Textual Tokens Classification for Multi-Modal Alignment in Vision-Language Tracking

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study