What question did this study set out to answer?

This research aims to enhance RGB-T tracking efficiency and robustness by proposing a novel self-distillation framework.

March 2, 2026

A Multi-level Self-Distillation-Based Unified Tracker for Efficient RGB-T Tracking

Key Points

This research aims to enhance RGB-T tracking efficiency and robustness by proposing a novel self-distillation framework.
Developed a Multi-level Self-Distillation framework adapting a one-stream RGB tracker.
Jointly processed RGB and thermal infrared inputs through a shared backbone.
Utilized self-supervised and supervised objectives to enhance feature representation.
Implemented contrastive loss for region alignment and modality-gap alignment loss.
MSD achieved state-of-the-art tracking accuracy on multiple benchmarks.
Maintained computational efficiency of the original RGB tracker.
Demonstrated practical advantages for deployment in real-world scenarios.

Abstract

RGB-Thermal (RGB-T) tracking enhances visual tracking robustness by combining RGB and thermal infrared (TIR) modalities, addressing limitations of RGB-only trackers under challenging conditions such as low light and appearance variations. However, most existing RGB-T trackers rely on complex fusion modules or modality-specific architectures, sacrificing efficiency for performance. In this paper, we propose a novel Multi-level Self-Distillation (MSD) framework that adapts a one-stream RGB tracker to the RGB-T setting without modifying the network architecture or adding any extra parameters. RGB and TIR inputs are jointly processed through a shared backbone, and training is guided by a combination of self-supervised and supervised objectives to enhance cross-modal feature representation. The self-supervised component includes a contrastive loss that aligns semantically consistent regions across template-search pairs, as well as a modality-gap alignment loss that reduces discrepancies between RGB and TIR features. These internal signals complement task-driven supervision, including an intermediate focal loss that strengthens early localization by enhancing shallow and mid-level features, modality-specific losses that preserve distinctive cues under partial modality degradation, and a fused tracking loss that drives final bounding box prediction. Comprehensive evaluations on LasHeR, RGBT234, and GTOT benchmarks demonstrate that MSD achieves state-of-the-art tracking accuracy while maintaining the computational efficiency of the original RGB tracker. Our work establishes a new paradigm in multi-modal tracking by demonstrating that optimized training strategies can outperform complex architectural modifications, offering significant practical advantages for real-world deployment.

Bookmark

Cite This Study

Awad et al. (Thu,) studied this question.

synapsesocial.com/papers/69a528ecf1e85e5c73bf05ed https://doi.org/https://doi.org/10.1109/tip.2026.3666737

Bookmark