Open-vocabulary Video Instance Segmentation addresses the challenging task of detecting, segmenting, and tracking objects in videos, including categories not encountered during training. However, existing approaches often overlook rich temporal cues from preceding frames, limiting their ability to leverage causal context for robust open-world generalization. To bridge this gap, we propose CPOVIS, a novel framework that introduces causal prompts-dynamically propagated visual and taxonomy prompts from historical frames-to enhance temporal reasoning and semantic consistency. Built upon a Mask2Former architecture with a CLIP backbone, CPOVIS integrates three core innovations: (1) PromptCLIP, which aligns cross-modal embeddings while preserving open-vocabulary capabilities; (2) a Visual Prompt Injector that propagates object-level features to maintain spatial-temporal coherence; and (3) a Taxonomy Prompt Infuser that leverages hierarchical semantic relationships to stabilize unseen category recognition. Furthermore, we introduce a contrastive learning strategy to disentangle object representations across frames and adapt the Segment Anything Model (SAM2) to boost open-vocabulary segmentation and tracking capacity in open-vocabulary video scenarios. Extensive experiments on seven challenging open- and closed-vocabulary video segmentation benchmarks demonstrate CPOVIS's state-of-the-art performance, outperforming existing methods by significant margins. Our findings highlight the critical role of causal prompt propagation in advancing video understanding in open-world scenarios.
Zheng et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: