Key points are not available for this paper at this time.
Multimodal vision-language (VL) learning has noticeably pushed the tendency toward generic intelligence owing to emerging large foundation models. However, tracking, as a fundamental vision problem, surprisingly enjoys less bonus from recent flourishing VL learning. We argue that the reasons are two-fold: the lack of large-scale vision-language annotated videos and ineffective visionlanguage interaction learning of current works. These nuisances motivate us to design more effective vision-language representation for tracking, meanwhile constructing a large database with language annotation for model learning. Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a largescale vision-language tracking database with more than 23,000 videos. We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer). To further improve VL representation, we introduce a contrastive loss to align different modalities. To thoroughly evidence the effectiveness of our method, we integrate the proposed framework on three tracking methods with different designs, i.e., the CNNbased SiamCAR 1, the Transformer-based OSTrack 2, and the hybrid structure TransT 3. The experiments demonstrate that our framework can significantly improve all baselines on six benchmarks. Besides empirical results, we theoretically analyze our approach to show its rationality. By revealing the potential of VL representation, we expect the community to divert more attention to VL tracking and hope to open more possibilities for future tracking with diversified multimodal messages
Building similarity graph...
Analyzing shared references across papers
Loading...
Mingzhe Guo
Beijing Jiaotong University
Zhipeng Zhang
Universidad del Noreste
Liping Jing
Beijing Jiaotong University
IEEE Transactions on Pattern Analysis and Machine Intelligence
Stony Brook University
University of North Texas
Beijing Jiaotong University
Building similarity graph...
Analyzing shared references across papers
Loading...
Guo et al. (Tue,) studied this question.
synapsesocial.com/papers/68e6634ab6db6435875efc6d — DOI: https://doi.org/10.1109/tpami.2024.3409078