Video inpainting modifies local regions in video while ensuring spatial and temporal coherence. However, existing methods-both traditional and recent diffusion-based ones-face key limitations: they lack unified support for both insertion and completion, and are restricted to single-object inpainting, making it difficult to handle multi-object scenarios involving grounding and interaction. In this paper, we propose MultiPaint, a unified framework for multi-task, multi-object, and multi-condition video inpainting. Firstly, we introduce dual-branch adapters to unify the insertion and completion tasks within a single model. Moreover, we propose a test-time scheduled feature composition strategy that enables multi-object inpainting with user-specified locations while better preserving interactions among objects, a setting that has been insufficiently addressed in prior work. Additionally, we introduce a multi-condition inpainting scheme that integrates text-guided, image-guided, and keyframe-guided modes via dynamic frame masking, providing more controllability in appearance customization. Extensive experiments show that MultiPaint achieves state-of-the-art performance on object insertion and scene completion among the recent works. We further demonstrate its versatility in downstream tasks including grounded video generation, object editing, object removal, image-guided inpainting, and long video inpainting.
Yang et al. (Thu,) studied this question.