What question did this study set out to answer?

This research aims to improve the process of generating 3D human movements from textual prompts by addressing existing limitations.

May 15, 2026

局部生成和全局融合的文本驱动三维人体动作扩散生成

Key Points

This research aims to improve the process of generating 3D human movements from textual prompts by addressing existing limitations.
Proposed a cascading diffusion generation framework integrating local and global features.
Utilized a large language model to independently describe body parts for action generation.
Developed a global fusion network to combine local features into biomechanically accurate full-body postures.
Demonstrated improvements in generation quality and action accuracy compared to baseline methods, with detailed evaluations on HumanML3D and KIT-ML datasets.
Achieved higher scores in FID and CLIP-S metrics compared to baseline methods.

Abstract

目的根据文本提示生成三维人体动作是多模态生成领域的前沿研究方向。尽管当前已经取得了诸多的研究进展，但现有方法在语义对齐精度、局部动作控制和全局协调性方面存在局限，难以实现从文本到高保真三维资产的一体化生成。针对上述问题，本文提出一种局部生成与全局融合的级联式扩散生成框架。方法首先，利用大语言模型将输入文本自动解耦为头部、四肢及躯干等六个部位的独立语义描述；其次，构建六路并行、梯度隔离的局部扩散编码器，为各部位独立生成动作特征；再次，设计全局融合网络将局部特征融合为符合生物力学的全身姿态，并解码为SMPL（a skinned multi-person linear model）参数化网格；最后，将SMPL网格转换为3D高斯表示，并引入二维扩散模型作为视觉先验，通过分数蒸馏采样优化其外观细节，实现从文本到可实时渲染三维人体的一体化生成。结果在HumanML3D（3D human motion-language Dataset）和KIT-ML（the KIT motion-language dataset）数据集上开展了对比实验，并从FID（Fréchet inception distance）、和CLIP-S（CLIP similarity）两个维度评估分析本文以及基线对比方法的生成结果。相较于基线方法，本文方法在生成质量和动作准确度方面均有提升，消融实验验证了本文设计思路的有效性。结论本文方法能够有效提升所生成人体动作的细节表现力、多样性以及文本语义一致性，为三维人体动作生成提供了高效、可扩展的技术方案。

Bookmark

View Full Paper

Cite This Study

Renjie et al. (Thu,) studied this question.

synapsesocial.com/papers/6a06b7a1e7dec685947aa6e4 https://doi.org/https://doi.org/10.11834/jig.250606

Bookmark

View Full Paper