What question did this study set out to answer?

The study aims to create a framework for generating personalized avatars that can express emotions effectively during human-computer interactions.

May 21, 2026Open Access

MAVAGEN: Multimodal Avatar Generation Framework for Personalized Human–Computer Interaction

Key Points

The study aims to create a framework for generating personalized avatars that can express emotions effectively during human-computer interactions.
Developed MAVAGEN, a multimodal avatar generation framework.
Used identity images, user specifications, and multiple feature streams including textual, emotion-distribution, and acoustic features.
Evaluated MAVAGEN against existing human animation methods using quantitative metrics like FID and FVD.
Achieved best overall avatar quality with FID 48.20 and FVD 592.00.
Emotion and acoustic features significantly improved emotional agreement, contributing to user experience.
Landmark-based pose and depth features enhanced geometric and motion stability.

Abstract

Digital-avatar systems still provide limited control over emotionally expressive behavior in human–computer interaction, especially in Large Language Model (LLM)-based chatbots and virtual assistants with personalized visual embodiments. To address this problem, we propose Multimodal Avatar Generation (MAVAGEN), a multimodal avatar generation framework for synthesizing upper-body digital avatars with personalized appearance and controllable emotional expression. The user specifies the desired gender and age, as well as provides a short text input from which the target emotional state is inferred. MAVAGEN then retrieves an identity image from the HaGRIDv2-1M corpus and generates an avatar clip with synchronized facial expressions, hand gestures, and expressive speech. The framework uses the following six feature streams: textual features, emotion-distribution features, landmark-based pose features, depth-geometry features, RGB-appearance features, and acoustic features. In a quantitative evaluation against recent human animation methods, MAVAGEN achieves the best overall avatar quality, with FID 48.20, FVD 592.00, SSIM 0.741, Sync-C 7.40, HKC 0.929, HKV 25.30, CSIM 0.563, and EmoAcc 0.88. Ablation results show that emotion and acoustic features contribute most to emotional agreement, while landmark-based pose and depth features improve geometric and motion stability. These results support the practical use of MAVAGEN in personalized LLM-based assistants and other emotion-sensitive interactive systems.

MAVAGEN: Multimodal Avatar Generation Framework for Personalized Human–Computer Interaction

Key Points

Abstract

Cite This Study