What question did this study set out to answer?

This research aims to enhance visual tracking performance by combining CNN and ViT architectures through hybrid attention mechanisms.

May 16, 2026Open Access

Rethinking two-stream discriminative tracking via hybrid attention and efficient transformer

Key Points

This research aims to enhance visual tracking performance by combining CNN and ViT architectures through hybrid attention mechanisms.
Developed hybrid fusion modules with parallel CNN and ViT backbones.
Implemented simplified self-attention for efficient feature interaction.
Designed a low-cost temporal update mechanism for adaptive tracking.
The hybrid tracking model significantly outperforms state-of-the-art methods across eight benchmarks.
Achieved robust feature representation while maintaining low training costs.
Enhanced adaptability to dynamic target appearance variations during continuous tracking.

Abstract

Notably, ResNet, as a well-established Convolutional Neural Network (CNN) backbone, has demonstrated strong universality and practicality in visual tracking tasks. In contrast to ViT, CNNs inherently possess favorable inductive biases and lower computational overhead, while ViTs excel at capturing long-range contextual dependencies and exhibit superior feature expressiveness. To synergize the complementary strengths of these two architectures, we propose a suite of hybrid fusion modules with parallel CNN and ViT backbones. Specifically, we systematically investigate diverse architectural combinations and adaptive fusion strategies to obtain rich multi-grained feature representations while preserving low training cost. Furthermore, we replace computationally expensive Transformer-based interaction modules with simplified self-attention operations, thereby achieving efficient feature interaction without sacrificing tracking performance. Finally, to better accommodate dynamic target appearance variations during continuous tracking, we design a low-cost temporal update mechanism driven by real-time prediction results, which effectively enhances the model’s adaptive capacity to target drift and environmental changes. Extensive experiments conducted on eight mainstream tracking benchmarks demonstrate that the proposed tracker outperforms state-of-the-art methods by a significant margin. The source code, raw tracking results, and pre-trained models are publicly available at https://github.com/hexdjx/VisTrack .

Rethinking two-stream discriminative tracking via hybrid attention and efficient transformer

Key Points

Abstract

Cite This Study