What question did this study set out to answer?

This research aims to develop a system for generating 4D human avatars using textual descriptions.

February 26, 2026

CLIP-Actor-X: Text-driven 4D Human Avatar Generation via Cross-modal Synthesis-through-Optimization

Key Points

This research aims to develop a system for generating 4D human avatars using textual descriptions.
Developed a text-driven human motion synthesis module based on a generative model.
Implemented a zero-shot neural style optimization for texturizing a neutral human mesh.
Utilized spatio-temporal view augmentation to enhance the optimization process.
Used visibility-aware embedding attention to improve rendering quality.
Generated human avatars with detailed geometry and texture directly from text prompts.
Produced motion animations that are temporally consistent and pose-agnostic.
Achieved output representations that are animatable without additional post-processing.

Abstract

We propose CLIP-Actor-X, a text-driven motion generation and neural mesh stylization system for 4D human avatar generation. CLIP-Actor-X generates a detailed 3D human mesh, motion animation, and texture to conform to a given text prompt input from a user. CLIP- Actor-X system mainly consists of two modules. First, for generating realistic human motion, we build a text-driven human motion synthesis module modeled by a retrieval-augmented generative model, powered by a text-to-motion diffusion model. Second, our novel zero-shot neural style optimization module detailizes and texturizes the sampled sequence of a neutral human mesh template, such that the resulting mesh and appearance comply with the input text prompt in a temporally-consistent and pose-agnostic manner. In contrast to the prior arts that use an artist-designed, non-animatable mesh as an input, our output representation is animatable and better aligned between an input text and the generated avatar without additional post-processes, e.g., re-alignment, retargeting, or rigging. We further propose the ways to stabilize the optimization process: spatio-temporal view augmentation and visibility-aware embedding attention, which deals with poorly rendered views. We demonstrate that CLIP-Actor-X produces perceptually plausible and human-recognizable human avatar in motion with detailed geometry and texture solely from a natural language prompt.

Bookmark

Cite This Study

Youwang et al. (Thu,) studied this question.

synapsesocial.com/papers/699fe28895ddcd3a253e63b3 https://doi.org/https://doi.org/10.1109/tpami.2026.3665111

Bookmark