RGB-Thermal (RGB-T) tracking enhances visual tracking robustness by combining RGB and thermal infrared (TIR) modalities, addressing limitations of RGB-only trackers under challenging conditions such as low light and appearance variations. However, most existing RGB-T trackers rely on complex fusion modules or modality-specific architectures, sacrificing efficiency for performance. In this paper, we propose a novel Multi-level Self-Distillation (MSD) framework that adapts a one-stream RGB tracker to the RGB-T setting without modifying the network architecture or adding any extra parameters. RGB and TIR inputs are jointly processed through a shared backbone, and training is guided by a combination of self-supervised and supervised objectives to enhance cross-modal feature representation. The self-supervised component includes a contrastive loss that aligns semantically consistent regions across template-search pairs, as well as a modality-gap alignment loss that reduces discrepancies between RGB and TIR features. These internal signals complement task-driven supervision, including an intermediate focal loss that strengthens early localization by enhancing shallow and mid-level features, modality-specific losses that preserve distinctive cues under partial modality degradation, and a fused tracking loss that drives final bounding box prediction. Comprehensive evaluations on LasHeR, RGBT234, and GTOT benchmarks demonstrate that MSD achieves state-of-the-art tracking accuracy while maintaining the computational efficiency of the original RGB tracker. Our work establishes a new paradigm in multi-modal tracking by demonstrating that optimized training strategies can outperform complex architectural modifications, offering significant practical advantages for real-world deployment.
Awad et al. (Thu,) studied this question.