Notably, ResNet, as a well-established Convolutional Neural Network (CNN) backbone, has demonstrated strong universality and practicality in visual tracking tasks. In contrast to ViT, CNNs inherently possess favorable inductive biases and lower computational overhead, while ViTs excel at capturing long-range contextual dependencies and exhibit superior feature expressiveness. To synergize the complementary strengths of these two architectures, we propose a suite of hybrid fusion modules with parallel CNN and ViT backbones. Specifically, we systematically investigate diverse architectural combinations and adaptive fusion strategies to obtain rich multi-grained feature representations while preserving low training cost. Furthermore, we replace computationally expensive Transformer-based interaction modules with simplified self-attention operations, thereby achieving efficient feature interaction without sacrificing tracking performance. Finally, to better accommodate dynamic target appearance variations during continuous tracking, we design a low-cost temporal update mechanism driven by real-time prediction results, which effectively enhances the model’s adaptive capacity to target drift and environmental changes. Extensive experiments conducted on eight mainstream tracking benchmarks demonstrate that the proposed tracker outperforms state-of-the-art methods by a significant margin. The source code, raw tracking results, and pre-trained models are publicly available at https://github.com/hexdjx/VisTrack .
Li et al. (Thu,) studied this question.