What question did this study set out to answer?

The aim is to create a system that generates detailed, animated avatars from a one-minute video, enhancing realism in digital representations.

January 22, 2026

Make-Your-Anchor+: Temporal Consistent 2D Avatar Generation via Video Diffusion Prior

Key Points

The aim is to create a system that generates detailed, animated avatars from a one-minute video, enhancing realism in digital representations.
Developed a structure-guided diffusion model that utilizes input video for training.
Adopted a two-stage training strategy to map movements and appearances.
Extracted human motion information leveraging pretrained video diffusion weights.
Introduced a batch-overlapped temporal denoising module for longer video generation.
Implemented a novel identity-specific face enhancement module.
The system demonstrated superior visual quality compared to existing methods.
Achieved improved temporal coherence in generated animations.
Preserved individual identity effectively in avatar outputs.
Outperformed state-of-the-art (SOTA) diffusion and non-diffusion methods.

Abstract

Despite the remarkable process of talking-head-based avatar-creating solutions, directly generating anchor-style videos with full-body motions remains challenging. In this study, we propose Make-Your-Anchor+, a novel system necessitating only a one-minute video clip of an individual for training, subsequently enabling the automatic generation of anchor-style videos with precise torso and hand movements. Specifically, we finetune a proposed structure-guided diffusion model on input video to render 3D mesh conditions into human appearances. We adopt a two-stage training strategy for the diffusion model, effectively mapping movements with specific appearances to create digital avatars for online streamers, live shopping hosts, and other applications. To produce arbitrary long temporal video, we extract human motion information from video diffusion prior by adapting the frame-wise diffusion model to pretrained video diffusion weights with lower cost, and a simple yet effective batch-overlapped temporal denoising module is proposed to bypass the constraints on video length during inference. Finally, a novel identity-specific face enhancement module is introduced to improve the visual quality of facial regions in the output videos. Comparative experiments demonstrate the system's effectiveness and superiority in visual quality, temporal coherence, and identity preservation, outperforming SOTA diffusion/non-diffusion methods.

Bookmark

Make-Your-Anchor+: Temporal Consistent 2D Avatar Generation via Video Diffusion Prior

Key Points

Abstract

Cite This Study

Also Consider

Also Consider