What type of study is this?

This is a Quantitative Study study.

September 18, 2025

LVOS: A Benchmark for Large-Scale Long-Term Video Object Segmentation

LHLingyi HongCity University of Hong Kong LZLiu Zhong-yingChina Agricultural University WCWenchao ChenFudan University

Key Points

On LVOS, existing VOS models face a significant performance drop due to challenges in long-term scenarios.
The benchmark includes 720 videos with over 407,000 high-quality annotations, facilitating robust evaluation.
Long video lengths contribute to accuracy declines, alongside complex factors like reappearance and occlusion.
LVOS aims to advance VOS development in realistic settings, addressing limitations of previous short-term benchmarks.

Abstract

Video object segmentation (VOS) aims to distinguish and track target objects in a video. Despite the excellent performance achieved by off-the-shelf VOS models, part of the existing VOS benchmarks mainly focuses on short-term videos, where objects remain visible most of the time. However, these benchmarks may not fully capture challenges encountered in practical applications, and the absence of long-term datasets restricts further investigation of VOS in realistic scenarios. Thus, we propose a novel benchmark named LVOS, comprising 720 videos with 296,401 frames and 407,945 high-quality annotations. Videos in LVOS last 1.14 minutes on average. Each video includes various attributes, especially challenges encountered in the wild, such as long-term reappearing and cross-temporal similar objects. Compared to previous benchmarks, our LVOS better reflects VOS models' performance in real scenarios. Based on LVOS, we evaluate 15 existing VOS models under 3 different settings and conduct a comprehensive analysis. On LVOS, these models suffer a large performance drop, highlighting the challenge of achieving precise tracking and segmentation in real-world scenarios. Attribute-based analysis indicates that one of the significant factors contributing to accuracy decline is the increased video length, interacting with complex challenges such as long-term reappearance, cross-temporal confusion, and occlusion, which emphasize LVOS's crucial role. We hope our LVOS can advance development of VOS in real scenes.

Perguntar à IA

Bookmark

View Full Paper