What question did this study set out to answer?

The aim is to develop a self-supervised framework for video object segmentation that efficiently learns from unlabeled videos.

May 17, 2026

View Full Paper

Mask-Guided Self-Supervised Video Object Segmentation

RQRuijie QuanZhejiang University LLLiulei LiZhejiang University ZYZongxin YangZhejiang University

Key Points

The aim is to develop a self-supervised framework for video object segmentation that efficiently learns from unlabeled videos.
Created a unified framework for cross-frame dense correspondence and object-level context.
Alternated between clustering video pixels for pseudo segmentation and learning mask encoding/decoding.
Incorporated online clustering to enhance the precision of pseudo labels and streamline training.
Achieved state-of-the-art performance on DAVIS$_{17}$, YouTube-VOS, and VIP benchmarks.
Enhanced label generation reduced the lag between self- and fully-supervised methods.
Improved algorithm facilitated faster training without compromising label quality.

Abstract

This work aims to learn video object segmentation (mask propagation) in a self-supervised manner. We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning and embeds object-level context for target-mask decoding. As a result, it is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos, in contrast to previous efforts usually relying on an oblique solution - cheaply "copying" labels according to pixel-wise correlations. Concretely, our algorithm alternates between i) clustering video pixels for creating pseudo segmentation labels ex nihilo; and ii) utilizing the pseudo labels to learn mask encoding and decoding for VOS. Unsupervised correspondence learning is further incorporated into this self-taught, mask embedding scheme, so as to ensure the generic nature of the learnt representation and avoid cluster degeneracy. Building on the above design, we further lift the offline clustering process to an online version, which streamlines the integration of mask embedding and dense correspondence modeling, thus resulting in a more cohesive online learning framework. Consequently, this facilitates more effective generation of pseudo labels without compromising the training speed. Additionally, we introduce a semantic centroids pool, a repository of content-aware visual representations across the entire video data. Managed through multi-frame clustering centroids, it is able to increase the precision and reliability of pseudo segmentation labels. Our improved algorithm sets state-of-the-arts on three standard benchmarks (ie, DAVIS₁₇, YouTube-VOS, and VIP), narrowing the gap between self- and fully-supervised VOS. Code is available at Mask-VOS.

AI에게 질문

Bookmark

View Full Paper

Cite This Study

Quan et al. (Thu,) studied this question.

synapsesocial.com/papers/6a095a877880e6d24efe0843 https://doi.org/https://doi.org/10.1109/tpami.2026.3692914

AI에게 질문

Bookmark

View Full Paper