This work aims to learn video object segmentation (mask propagation) in a self-supervised manner. We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning and embeds object-level context for target-mask decoding. As a result, it is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos, in contrast to previous efforts usually relying on an oblique solution - cheaply "copying" labels according to pixel-wise correlations. Concretely, our algorithm alternates between i) clustering video pixels for creating pseudo segmentation labels ex nihilo; and ii) utilizing the pseudo labels to learn mask encoding and decoding for VOS. Unsupervised correspondence learning is further incorporated into this self-taught, mask embedding scheme, so as to ensure the generic nature of the learnt representation and avoid cluster degeneracy. Building on the above design, we further lift the offline clustering process to an online version, which streamlines the integration of mask embedding and dense correspondence modeling, thus resulting in a more cohesive online learning framework. Consequently, this facilitates more effective generation of pseudo labels without compromising the training speed. Additionally, we introduce a semantic centroids pool, a repository of content-aware visual representations across the entire video data. Managed through multi-frame clustering centroids, it is able to increase the precision and reliability of pseudo segmentation labels. Our improved algorithm sets state-of-the-arts on three standard benchmarks (ie, DAVIS₁₇, YouTube-VOS, and VIP), narrowing the gap between self- and fully-supervised VOS. Code is available at Mask-VOS.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ruijie Quan
Zhejiang University
Liulei Li
Zhejiang University
Zongxin Yang
Zhejiang University
IEEE Transactions on Pattern Analysis and Machine Intelligence
Zhejiang University
Building similarity graph...
Analyzing shared references across papers
Loading...
Quan et al. (Thu,) studied this question.
synapsesocial.com/papers/6a095a877880e6d24efe0843 — DOI: https://doi.org/10.1109/tpami.2026.3692914