What does this research mean for the field?

The MultiPaint framework achieves state-of-the-art performance in video inpainting by unifying object insertion and scene completion, and enabling multi-object and multi-condition inpainting. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This research aims to develop a unified framework for efficiently handling video inpainting tasks involving multiple objects and conditions.

May 16, 2026

MultiPaint: A Unified Framework for Multi-task, Multi-object, and Multi-condition Video Inpainting

Key Points

This research aims to develop a unified framework for efficiently handling video inpainting tasks involving multiple objects and conditions.
Introduced dual-branch adapters for integrating insertion and completion tasks into a single model.
Implemented a scheduled feature composition strategy enabling user-specified multi-object inpainting.
Developed a dynamic frame masking scheme for controlling inpainting modes such as text-guided and image-guided.
MultiPaint achieved state-of-the-art performance on both object insertion and scene completion tasks.
Demonstrated versatility in grounded video generation and object editing through extensive experiments.

Abstract

Video inpainting modifies local regions in video while ensuring spatial and temporal coherence. However, existing methods-both traditional and recent diffusion-based ones-face key limitations: they lack unified support for both insertion and completion, and are restricted to single-object inpainting, making it difficult to handle multi-object scenarios involving grounding and interaction. In this paper, we propose MultiPaint, a unified framework for multi-task, multi-object, and multi-condition video inpainting. Firstly, we introduce dual-branch adapters to unify the insertion and completion tasks within a single model. Moreover, we propose a test-time scheduled feature composition strategy that enables multi-object inpainting with user-specified locations while better preserving interactions among objects, a setting that has been insufficiently addressed in prior work. Additionally, we introduce a multi-condition inpainting scheme that integrates text-guided, image-guided, and keyframe-guided modes via dynamic frame masking, providing more controllability in appearance customization. Extensive experiments show that MultiPaint achieves state-of-the-art performance on object insertion and scene completion among the recent works. We further demonstrate its versatility in downstream tasks including grounded video generation, object editing, object removal, image-guided inpainting, and long video inpainting.

Bookmark

MultiPaint: A Unified Framework for Multi-task, Multi-object, and Multi-condition Video Inpainting

Key Points

Abstract

Cite This Study