What type of study is this?

September 10, 2025Open Access

A method for generating digital avatar animation with speech and non-verbal synchronization based on bimodal data

Key Points

The proposed method effectively synchronizes speech with avatar facial expressions and gestures, accounting for emotional context.
Using a specialized dataset focused on gesture modeling, visual and semantic coherence was achieved with metrics like FID — 50.13.
The approach integrates acoustic, visual, and affective features into a unified model for realistic avatar behavior.
Innovative use of cross-modal alignment strategies ensures high-precision synchronization between speech and non-verbal actions.

Abstract

This paper addresses the task of generating animations of a digital avatar that synchronously reproduces speech, facial expressions, and gestures based on a bimodal input — namely, a static image and an emotionally colored text. The study explores the integration of acoustic, visual, and affective features into a unified model that enables realistic and expressive avatar behavior aligned with both the semantic content and emotional tone of the utterance. The proposed method includes several stages: extraction of visual landmarks of the face, hands, and body pose; gender recognition for selecting an appropriate voice profile; emotional analysis of the input text; and generation of synthetic speech. All extracted features are integrated within a generative architecture based on a diffusion model enhanced with temporal attention mechanisms and cross-modal alignment strategies. This ensures high-precision synchronization between speech and the avatar nonverbal behavior. The training process utilized two specialized datasets: one focused on gesture modeling, and the other on facial expression synthesis. Annotation was performed using automated spatial landmark extraction tools. Experimental evaluation was conducted on a multiprocessor computing platform with GPU acceleration. The model performance was assessed using a set of objective metrics. The proposed method demonstrated a high degree of visual and semantic coherence: FID — 50.13, FVD — 601.70, SSIM — 0.752, PSNR — 21.997, E-FID — 2.226, Sync-D — 7.003, Sync-C — 6.398. The model effectively synchronizes speech with facial expressions and gestures, accounts for the emotional context of the text, and incorporates features of Russian Sign Language. The proposed approach has potential applications in emotionally aware human — computer interaction systems, digital assistants, educational platforms, and psychological interfaces. The method is of interest to researchers in artificial intelligence, multimodal interfaces, computer graphics, and digital psychology.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Alexandr Axyonov

Elena Ryumina

National Research University Higher School of Economics

Dmitry Ryumin

National Research University Higher School of Economics

Journals

Scientific and technical journal of information technologies mechanics and optics

Actions

Institutions

Russian Academy of Sciences

State Research Center of the Russian Federation

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A method for generating digital avatar animation with speech and non-verbal synchronization based on bimodal data

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study