With the advances of Deep learning, the field of video inpainting has also made significant progress recently, leading to the emergence of deep learning-based video inpainting, also known as Deep video inpainting. It learns the potential rules or feature distributions of the video dataset in a data-driven manner to complete the missing areas in the video from a spatial-temporal perspective. Its original goal is to recover damaged or lost parts of videos, but it is also used to maliciously remove target objects. As a result, the development of Deep video inpainting has brought negative effects and potential threats to the country, society, and individuals. Therefore, the detection of this issue has also attracted wide research interests in the field of information security. The primary objective of this article is to provide a comprehensive summary of Deep video inpainting and the corresponding detection methods. Specifically, we classify existing Deep video inpainting methods into different categories from the perspective of their designed deep learning module, including 3D convolution-based, optical flow-based, alignment-based, temporal shift-based, attention-based, and diffusion-based network models. Meanwhile, we also sort existing research on Deep video inpainting detection into four categories: spatial-domain, temporal-domain, frequency-domain, and hybrid-domain network models, starting from a network feature analysis perspective. In addition, we review their training objectives, loss functions, and common benchmark datasets. We present video-level and pixel-level evaluation metrics, conduct a qualitative and quantitative evaluation, and discuss the advantages and disadvantages of representative Deep video inpainting and their corresponding detection methods. Finally, potential future research directions have been outlined for Deep video inpainting and its detection methods.
Yao et al. (Tue,) studied this question.