This paper addresses the task of generating animations of a digital avatar that synchronously reproduces speech, facial expressions, and gestures based on a bimodal input — namely, a static image and an emotionally colored text. The study explores the integration of acoustic, visual, and affective features into a unified model that enables realistic and expressive avatar behavior aligned with both the semantic content and emotional tone of the utterance. The proposed method includes several stages: extraction of visual landmarks of the face, hands, and body pose; gender recognition for selecting an appropriate voice profile; emotional analysis of the input text; and generation of synthetic speech. All extracted features are integrated within a generative architecture based on a diffusion model enhanced with temporal attention mechanisms and cross-modal alignment strategies. This ensures high-precision synchronization between speech and the avatar nonverbal behavior. The training process utilized two specialized datasets: one focused on gesture modeling, and the other on facial expression synthesis. Annotation was performed using automated spatial landmark extraction tools. Experimental evaluation was conducted on a multiprocessor computing platform with GPU acceleration. The model performance was assessed using a set of objective metrics. The proposed method demonstrated a high degree of visual and semantic coherence: FID — 50.13, FVD — 601.70, SSIM — 0.752, PSNR — 21.997, E-FID — 2.226, Sync-D — 7.003, Sync-C — 6.398. The model effectively synchronizes speech with facial expressions and gestures, accounts for the emotional context of the text, and incorporates features of Russian Sign Language. The proposed approach has potential applications in emotionally aware human — computer interaction systems, digital assistants, educational platforms, and psychological interfaces. The method is of interest to researchers in artificial intelligence, multimodal interfaces, computer graphics, and digital psychology.
Building similarity graph...
Analyzing shared references across papers
Loading...
Alexandr Axyonov
Elena Ryumina
National Research University Higher School of Economics
Dmitry Ryumin
National Research University Higher School of Economics
Scientific and technical journal of information technologies mechanics and optics
Russian Academy of Sciences
State Research Center of the Russian Federation
Building similarity graph...
Analyzing shared references across papers
Loading...
Axyonov et al. (Fri,) studied this question.
synapsesocial.com/papers/68c182589b7b07f3a060f043 — DOI: https://doi.org/10.17586/2226-1494-2025-25-4-651-662