Recently, the challenging task of Open-Vocabulary Video Instance Segmentation (OVVIS) has been proposed. The OVVIS task requires simultaneously classifying, segmenting, and tracking objects in videos from an open set of categories, including novel categories unseen during training. Previous approaches typically rely on universal object proposals, memory-induced tracking, and open-vocabulary classification, which are often incompatible with established VIS and open-vocabulary segmentation methods. Observing that recent VIS methods share a common architecture decomposed into a segmenter and a tracker, we design a simple yet effective Switchable Open-vocabulary VIS (SOV) framework. SOV consists of an Open-Vocabulary Segmenter and a Dual Memory Tracker. The segmenter incorporates a frozen CLIP vision encoder as the backbone to enhance generalization on novel categories. The Dual Memory Tracker is training-free and utilizes a dual-memory mechanism to enhance tracking robustness. Moreover, we can easily switch to various trackers. Benefiting from this design, SOV can inherit advantages from state-of-the-art VIS methods. To further optimize training efficiency, we propose a progressive ”Long-Image, Short-Video” training pipeline. This strategy decouples the training process into an extensive image-level pre-training phase followed by a rapid video-level adaptation phase, significantly accelerating convergence while effectively bridging the domain gap between static images and dynamic videos. Our method outperforms previous methods by large margins on various benchmarks while maintaining faster inference speeds. Specifically, SOV achieves 38.0 mAP on the LV-VIS validation set. It also achieves strong zero-shot performance on popular VIS datasets (YTVIS19 50.9 mAP, YTVIS21 45.2 mAP, OVIS 23.1 mAP), comparable to fully-supervised methods. To further validate the flexibility of our switchable architecture, we extend SOV with the state-of-the-art CTVIS tracker, which yields improved performance (51.3 mAP) on YTVIS19. Code is available in the supplementary material.
Building similarity graph...
Analyzing shared references across papers
Loading...
Feng Zhu
Ling Chen
Yunchao Wei
ACM Transactions on Multimedia Computing Communications and Applications
University of Technology Sydney
Beijing Jiaotong University
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhu et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69e866ad6e0dea528ddeb043 — DOI: https://doi.org/10.1145/3803013