Generating high-quality character animation videos is a fascinating yet challenging task. Existing methods use geometry guidance signals like skeletons, normal maps, or depth maps in a diffusion model to generate character videos from a single reference image. Although these approaches have shown encouraging results, they solely rely on cross attention layers to extract geometry guidance which inevitably leads to temporal inconsistencies and reduced quality. In this paper, we present a novel framework AniFeats to generate high-quality character animation videos. In contrast to existing methods, our key insight is to incorporate explicit features on 3D character meshes during the video generation to achieve significantly improved temporal consistency. Specifically, AniFeats extracts detailed features from the reference image, projects them onto 3D feature meshes based on SMPL-X, and utilizes rendered feature maps from the animated 3D feature meshes as guidance throughout the generation process. This approach directly links local patterns in the input image to those in the output video, effectively strengthening temporal coherence. Extensive experiments demonstrate that AniFeats generates high-quality, temporally consistent character animations with remarkably enhanced realism.
Lu et al. (Thu,) studied this question.