RGBT tracking holds irreplaceable value in unmanned aerial vehicle (UAV) ground observation missions, effectively supporting scenarios such as nighttime monitoring and low-altitude reconnaissance. However, existing frameworks based on CNNs or Transformers face inherent trade-offs between interaction capabilities and computational efficiency. Furthermore, current methods perform poorly in challenging scenarios involving target scale variations and rapid motion from UAV perspectives. To address these issues, this paper proposes a novel multimodal interaction and fusion Mamba network (MIFMNet), which achieves fundamental innovations relative to existing RGB-T fusion trackers and recent Mamba-based tracking methods. Different from existing RGB-T trackers that rely on CNN’s local convolution or Transformer’s quadratic-complexity self-attention for cross-modal fusion, MIFMNet departs from these architectures and designs modality-adaptive interaction mechanisms based on Mamba, fully leveraging the complementary information while resolving the efficiency-accuracy trade-off. Specifically, this paper designs the scale differential enhanced Mamba (SDEM), which expands the receptive field through multiscale parallel convolutions while amplifying complementary information via differential strategies to enhance feature responses to scale-varying objects. Furthermore, we propose flow-guided multilayer interaction Mamba (FMIM), which integrates inter-frame motion information into scanning prediction. This enables the network to adaptively adjust interaction priorities between shallow texture and high-level semantic features based on motion intensity, mitigating early information forgetting and enhancing robustness in dynamic scenes. Extensive experiments on four major benchmarks demonstrate that MIFMNet achieves state-of-the-art performance on precision and success rate, particularly excelling in UAV scenarios involving occlusion, scale variations, and rapid motion. Simultaneously, it achieves an inference speed of 35.3 FPS, enabling efficient deployment on resource-constrained platforms, thereby providing robust support for UAV applications of RGBT tracking.
Guo et al. (Sun,) studied this question.