June 4, 2024Open Access

Divert More Attention to Vision-Language Object Tracking

Key Points

Key points are not available for this paper at this time.

Abstract

Multimodal vision-language (VL) learning has noticeably pushed the tendency toward generic intelligence owing to emerging large foundation models. However, tracking, as a fundamental vision problem, surprisingly enjoys less bonus from recent flourishing VL learning. We argue that the reasons are two-fold: the lack of large-scale vision-language annotated videos and ineffective visionlanguage interaction learning of current works. These nuisances motivate us to design more effective vision-language representation for tracking, meanwhile constructing a large database with language annotation for model learning. Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a largescale vision-language tracking database with more than 23,000 videos. We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer). To further improve VL representation, we introduce a contrastive loss to align different modalities. To thoroughly evidence the effectiveness of our method, we integrate the proposed framework on three tracking methods with different designs, i.e., the CNNbased SiamCAR 1, the Transformer-based OSTrack 2, and the hybrid structure TransT 3. The experiments demonstrate that our framework can significantly improve all baselines on six benchmarks. Besides empirical results, we theoretically analyze our approach to show its rationality. By revealing the potential of VL representation, we expect the community to divert more attention to VL tracking and hope to open more possibilities for future tracking with diversified multimodal messages

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Mingzhe Guo

Beijing Jiaotong University

Zhipeng Zhang

Universidad del Noreste

Liping Jing

Beijing Jiaotong University

Journals

IEEE Transactions on Pattern Analysis and Machine Intelligence

Actions

Institutions

Stony Brook University

University of North Texas

Beijing Jiaotong University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Divert More Attention to Vision-Language Object Tracking

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study