Reconstructing accurate full-body poses from sparse tracking is crucial in generative multimedia, yet achieving effective sequence modeling and high-quality generation remains challenging. Recently, the state space models (SSMs), especially Mamba, have showcased considerable promise in sequence modeling, indicating an appealing direction for building motion generation models. However, adapting SSM to this task is quite challenging due to two main aspects. Firstly, the sparsity of input conditions poses significant difficulty in accurately estimating full-body motions. Secondly, SSM lacks a specialized design to capture the temporal dependencies within motion sequences. To address these challenges, we propose a novel conditional diffusion model, Pose Mamba Diffusion (PMDiff), which comprises a bidirectional Pose Mamba Denoiser to achieve effective sequence modeling. Our PMDiff can effectively leverage the robust generation capabilities of diffusion models in conjunction with the sequence modeling proficiency of Mamba to track full-body poses accurately. Additionally, we notice that synthesized human motions often lack smoothness, resulting in unnatural motion sequences. To overcome this issue, we further introduce a learning-free Temporal Motion Filter (TMF), which can remarkably improve the smoothness of the generated human motion. Extensive experiments conducted on the large motion capture database (AMASS) show that our PMDiff outperforms previous state-of-the-art methods in terms of both accuracy and smoothness.
Xue et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: