What question did this study set out to answer?

This research aims to develop a method for multi-object 3D tracking using monocular video with language descriptions.

February 13, 2026

Monocular Multi-object 3D Visual Language Tracking

Key Points

This research aims to develop a method for multi-object 3D tracking using monocular video with language descriptions.
Developed MoMo-3DVLT framework for multi-object 3D tracking.
Created MoMo-3DRoVLT dataset with 8,216 annotated video sequences.
Designed MoMo-3DVLTracker neural model with multimodal features and language conditioning.
Proposed method outperforms existing multi-object tracking methods on the MoMo-3DRoVLT dataset.
Demonstrated improved localization of multiple objects using language descriptions.

Abstract

Visual Language Tracking (VLT) enables machines to perform tracking in real world through human-like language descriptions. However, existing VLT methods are limited to 2D spatial tracking or single-object 3D tracking and do not support multi-object 3D tracking within monocular video. This limitation arises because advancements in 3D multi-object tracking have predominantly relied on sensor-based data (e.g., point clouds, depth sensors) that lacks corresponding language descriptions. Moreover, natural language descriptions in existing VLT literature often suffer from redundancy, impeding the efficient and precise localization of multiple objects. We present the first technique to extend VLT to multi-object 3D tracking using monocular video. We introduce a comprehensive framework that includes (i) a Monocular Multi-object 3D Visual Language Tracking (MoMo-3DVLT) task, (ii) a large-scale dataset, MoMo-3DRoVLT, tailored for this task, and (iii) a custom neural model. Our dataset, generated with the aid of Large Language Models (LLMs) and manual verification, contains 8,216 video sequences annotated with both 2D and 3D bounding boxes, with each sequence accompanied by three freely generated, human-level textual descriptions. We propose MoMo-3DVLTracker, the first neural model specifically designed for MoMo-3DVLT. This model integrates a multimodal feature extractor, a visual language encoder-decoder, and modules for detection and tracking, setting a strong baseline for MoMo-3DVLT. Beyond existing paradigms, it introduces a task-specific structural coupling that integrates a differentiable linked-memory mechanism with depth-guided and language-conditioned reasoning for robust monocular 3D multi-object tracking. Experimental results demonstrate that our approach outperforms existing methods on the MoMo-3DRoVLT dataset. Our dataset and code are available at Github.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Hao Wei

Chinese Academy of Sciences

Rong Wang

Haixiang Hu

Journals

IEEE Transactions on Image Processing

Actions

Institutions

The University of Melbourne

Xidian University

Chang'an University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Monocular Multi-object 3D Visual Language Tracking

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study