What question did this study set out to answer?

This work aims to create a framework for generating co-speech gestures in virtual avatars using audio input.

June 1, 2023

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

LZLingting ZhuUniversity of Hong Kong XLXian LiuNvidia (United Kingdom)XLXuanyu LiuCentral South University

Key Points

This work aims to create a framework for generating co-speech gestures in virtual avatars using audio input.
Developed Diffusion Co-Speech Gesture framework to capture audio-gesture associations.
Implemented Diffusion Audio-Gesture Transformer for improved modality attention and temporal dependency modeling.
Introduced Diffusion Gesture Stabilizer with an annealed noise sampling strategy for temporal consistency.
DiffGesture achieved state-of-the-art performance in generating coherent gestures with improved mode coverage.
Showed stronger audio correlations in generated gestures across different audio inputs.

Abstract

Animating virtual avatars to make co-speech gestures facilitates various applications in human-machine interaction. The existing methods mainly rely on generative adversarial networks (GANs), which typically suffer from notorious mode collapse and unstable training, thus making it difficult to learn accurate audio-gesture joint distributions. In this work, we propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture), to effectively capture the cross-modal audio-to-gesture associations and preserve temporal coherence for high-fidelity audio-driven co-speech gesture generation. Specifically, we first establish the diffusion-conditional generation process on clips of skeleton sequences and audio to enable the whole framework. Then, a novel Diffusion Audio-Gesture Transformer is devised to better attend to the information from multiple modalities and model the long-term temporal dependency. Moreover, to eliminate temporal inconsistency, we propose an effective Diffusion Gesture Stabilizer with an annealed noise sampling strategy. Benefiting from the architectural advantages of diffusion models, we further incorporate implicit classifier-free guidance to trade off between diversity and gesture quality. Extensive experiments demonstrate that DiffGesture achieves state-of-the-art performance, which renders coherent gestures with better mode coverage and stronger audio correlations. Code is available at https://github.com/Advocate99/DiffGesture.

KI fragen

Bookmark

View Full Paper