What question did this study set out to answer?

This research aims to develop an effective framework for Open-Vocabulary Video Instance Segmentation (OVVIS) that enhances tracking and classification of objects in videos.

April 22, 2026

A Simple Switchable Framework for Open-Vocabulary Video Instance Segmentation

Puntos clave

This research aims to develop an effective framework for Open-Vocabulary Video Instance Segmentation (OVVIS) that enhances tracking and classification of objects in videos.
Designed a Switchable Open-vocabulary VIS (SOV) framework with an Open-Vocabulary Segmenter and a Dual Memory Tracker.
Incorporated a frozen CLIP vision encoder for better generalization on novel categories.
Proposed a progressive training pipeline for efficient adaptation from static images to dynamic videos.
SOV achieved 38.0 mAP on the LV-VIS validation set, outperforming previous methods.
Demonstrated strong zero-shot performance with YTVIS19 at 50.9 mAP and YTVIS21 at 45.2 mAP, comparable to fully-supervised methods.
Extended SOV with CTVIS tracker improved performance to 51.3 mAP on YTVIS19.

Resumen

Recently, the challenging task of Open-Vocabulary Video Instance Segmentation (OVVIS) has been proposed. The OVVIS task requires simultaneously classifying, segmenting, and tracking objects in videos from an open set of categories, including novel categories unseen during training. Previous approaches typically rely on universal object proposals, memory-induced tracking, and open-vocabulary classification, which are often incompatible with established VIS and open-vocabulary segmentation methods. Observing that recent VIS methods share a common architecture decomposed into a segmenter and a tracker, we design a simple yet effective Switchable Open-vocabulary VIS (SOV) framework. SOV consists of an Open-Vocabulary Segmenter and a Dual Memory Tracker. The segmenter incorporates a frozen CLIP vision encoder as the backbone to enhance generalization on novel categories. The Dual Memory Tracker is training-free and utilizes a dual-memory mechanism to enhance tracking robustness. Moreover, we can easily switch to various trackers. Benefiting from this design, SOV can inherit advantages from state-of-the-art VIS methods. To further optimize training efficiency, we propose a progressive ”Long-Image, Short-Video” training pipeline. This strategy decouples the training process into an extensive image-level pre-training phase followed by a rapid video-level adaptation phase, significantly accelerating convergence while effectively bridging the domain gap between static images and dynamic videos. Our method outperforms previous methods by large margins on various benchmarks while maintaining faster inference speeds. Specifically, SOV achieves 38.0 mAP on the LV-VIS validation set. It also achieves strong zero-shot performance on popular VIS datasets (YTVIS19 50.9 mAP, YTVIS21 45.2 mAP, OVIS 23.1 mAP), comparable to fully-supervised methods. To further validate the flexibility of our switchable architecture, we extend SOV with the state-of-the-art CTVIS tracker, which yields improved performance (51.3 mAP) on YTVIS19. Code is available in the supplementary material.

Preguntar a la IA

Me gusta

Guardar