Los puntos clave no están disponibles para este artículo en este momento.
Visual object tracking is a fundamental computer vision task recently extended to multimodal settings, where natural language descriptions complement visual information. Existing multimodal trackers typically rely on large-scale transformer architectures that jointly train visual and textual encoders, resulting in hundreds of millions of trainable parameters and substantial computational overhead. We propose a lightweight multimodal adapter that integrates textual descriptions into a state-of-the-art visual-only framework with minimal overhead. The pretrained visual and text encoders are frozen, and only a small projection network is trained to align text embeddings with visual features. The adapter is modular, can be toggled at inference, and has negligible impact on speed. Extensive experiments demonstrate that textual cues improve tracking robustness and enable efficient multimodal integration with over 100× fewer trainable parameters than heavy multimodal trackers, allowing training and deployment on resource-limited devices.
Borsuk et al. (Sat,) studied this question.