What question did this study set out to answer?

The study aims to improve video instance segmentation by leveraging causal prompts from previous frames.

March 6, 2026

Causal Prompts for Open-vocabulary Video Instance Segmentation

Key Points

The study aims to improve video instance segmentation by leveraging causal prompts from previous frames.
Developed CPOVIS framework with Mask2Former architecture and CLIP backbone.
Introduced PromptCLIP for aligning cross-modal embeddings.
Implemented a Visual Prompt Injector to maintain spatial-temporal coherence.
Utilized a Taxonomy Prompt Infuser for hierarchical semantic relationship recognition.
Adopted contrastive learning for disentangling object representations across frames.
CPOVIS showed significant improvements on seven video segmentation benchmarks.
Outperformed existing methods by notable margins in both open- and closed-vocabulary tasks.
Demonstrated enhanced temporal reasoning and semantic consistency in video understanding.

Abstract

Open-vocabulary Video Instance Segmentation addresses the challenging task of detecting, segmenting, and tracking objects in videos, including categories not encountered during training. However, existing approaches often overlook rich temporal cues from preceding frames, limiting their ability to leverage causal context for robust open-world generalization. To bridge this gap, we propose CPOVIS, a novel framework that introduces causal prompts-dynamically propagated visual and taxonomy prompts from historical frames-to enhance temporal reasoning and semantic consistency. Built upon a Mask2Former architecture with a CLIP backbone, CPOVIS integrates three core innovations: (1) PromptCLIP, which aligns cross-modal embeddings while preserving open-vocabulary capabilities; (2) a Visual Prompt Injector that propagates object-level features to maintain spatial-temporal coherence; and (3) a Taxonomy Prompt Infuser that leverages hierarchical semantic relationships to stabilize unseen category recognition. Furthermore, we introduce a contrastive learning strategy to disentangle object representations across frames and adapt the Segment Anything Model (SAM2) to boost open-vocabulary segmentation and tracking capacity in open-vocabulary video scenarios. Extensive experiments on seven challenging open- and closed-vocabulary video segmentation benchmarks demonstrate CPOVIS's state-of-the-art performance, outperforming existing methods by significant margins. Our findings highlight the critical role of causal prompt propagation in advancing video understanding in open-world scenarios.

AI에게 질문

Bookmark

Cite This Study

Zheng et al. (Thu,) studied this question.

synapsesocial.com/papers/69aa7008531e4c4a9ff59679 https://doi.org/https://doi.org/10.1109/tpami.2026.3669976

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

AI에게 질문

Bookmark