November 15, 2025Open Access

Lightweight Multimodal Adapter for Visual Object Tracking

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Visual object tracking is a fundamental computer vision task recently extended to multimodal settings, where natural language descriptions complement visual information. Existing multimodal trackers typically rely on large-scale transformer architectures that jointly train visual and textual encoders, resulting in hundreds of millions of trainable parameters and substantial computational overhead. We propose a lightweight multimodal adapter that integrates textual descriptions into a state-of-the-art visual-only framework with minimal overhead. The pretrained visual and text encoders are frozen, and only a small projection network is trained to align text embeddings with visual features. The adapter is modular, can be toggled at inference, and has negligible impact on speed. Extensive experiments demonstrate that textual cues improve tracking robustness and enable efficient multimodal integration with over 100× fewer trainable parameters than heavy multimodal trackers, allowing training and deployment on resource-limited devices.

Me gusta

Guardar

Ver artículo completo

Cite This Study

Borsuk et al. (Sat,) studied this question.

synapsesocial.com/papers/6a1278d2bb918b6e5b6769fc https://doi.org/https://doi.org/10.3390/bdcc9110292

Me gusta

Guardar

Ver artículo completo