Foundation models, exemplified by the Segment Anything Model (SAM), have revolutionized object segmentation with their impressive zero-shot capabilities. The recent SAM2 extended these abilities to the video domain, utilizing an object pointer and memory attention to maintain temporal segment consistency. However, a critical limitation of SAM2 is its vulnerability to error accumulation, where an initial incorrect mask can propagate through subsequent frames, leading to tracking failure. To address this, we propose a novel method that actively monitors the temporal segment consistency of masks by evaluating the distance of object pointers across frames. When a potential error is detected via a sharp increase in distance, our method triggers a particle filter based re-inference module. This framework models object’s motion to predict a corrected bounding box, effectively guiding the model to recover the valid mask and preventing error propagation. Extensive zero-shot evaluations on DAVIS, LVOS v2, YouTube-VOS and qualitative results show that the proposed, parameter-free procedure consistently improves temporal coherence, raising mean IoU by 0.1 on DAVIS, by 0.13 on the LVOS v2 train split and 0.05 on the LVOS v2 validation split, and by 0.02 on YouTube-VOS, thereby offering a simple and effective route to more robust video object segmentation with SAM2.
Lee et al. (Wed,) studied this question.