ABSTRACT Co‐speech gestures generation plays a key role in the field of virtual reality interaction, synthesizing proper digital human's actions passively with speech. The generative algorithm‐based method creates realistic gestures accompanying speech's rhythm and semantic content, improving the interactive experience. To match the persona of digital humans, gestures from algorithms often require additional modification before being applied to a virtual character. However, motion sequences are difficult to edit when generated from hidden motion representations. To make motion synthesis editable, the proposed method develops a two‐stage controllable gesture generation pipeline for the c‐speech ge‐ture generating problem. In stage 1, we design a novel large language model based ‐K‐Decoder th‐t takes speech and style label as input to synthesize inverse ki‐ematic s‐yle control points, which are highly editable. In stage 2, we divide the motion sequence into the body or fingers part for VQ‐based latent motion representation learning relatively. And a diffusion‐based IK‐Denoiser is proposed for'latent motion representation synthesis under the condition of control points. Compared to other representative algorithms, the proposed method gets a competitive performance of metrics such as Fréchet Gesture Distance, Beat Consistency, and Diversity. To demonstrate controllability, it provides three explicit control strategies for motion editing. With these control points, we provide a new co‐speech gesture generation paradigm.
Peng et al. (Sat,) studied this question.