Video Frame Interpolation (VFI) is essential in handling video processing to fill in the gaps between the initial and final frames and increase temporal resolution. This method is critical in applications like frame rate up-sampling, slow-motion rendering, and video improvement. This work compares and evaluates the merits and limitations of several different VFI methods based on their structures and interpolation performance. This paper summarizes conventional optical flow-based methods, kernel-based models, hybrid models based on depth estimation, flow-agnostic convolutional models, Transformer models, and new generative diffusion models. In particular, this paper compares each method's structural form, movement handling ability, and efficiency. Experimental evaluation demonstrates that transformer models, as well as diffusion models, are superior in treating large and complicated motions. By comparison, models such as Flow-agnostic video representations (FLAVR) balance efficiency and accuracy, making them ideal for real-time processing. Experimental evaluations indicate that the development of VFI methods shifts toward data-driven and globally conscious structures to capture the richness of motions better. Such findings inform future research and advance the real-time handling of video applications.
Tianyi Yin (Thu,) studied this question.