ABSTRACT Text‐driven 3D human motion editing aims to modify an existing motion sequence following natural language instructions, which is a crucial task for character animation, virtual agents, and motion authoring. Recent diffusion‐based methods have shown remarkable success in text‐to‐motion generation. Editing existing motions requires precise spatiotemporal control to localize modifications while preserving context. Current diffusion‐based motion editing methods lack explicit fine‐grained control over when and how strongly to edit. To address this, we propose TM‐Edit, a text guided Diffusion‐Transformer based motion editing framework which introduces learned temporal soft masks to provide explicit frame‐wise editing guidance. The proposed model predicts an editing intensity mask to encode high‐level intent from both the source motion and the text instruction. This mask is then used to modulate source motion features within a conditional diffusion process via an uncertainty‐aware gating mechanism, ensuring robust training and inference. Additionally, a feature semantic alignment loss is employed by using a pre‐trained motion retrieval model to enhance cross‐modal consistency. Extensive experiments on the MotionFix benchmark dataset demonstrate that our approach achieves state‐of‐the‐art performance. Code will be made publicly available.
Zheng et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: