What question did this study set out to answer?

This work aims to address the challenges in generating realistic two-person interactive motions by capturing the temporal dependencies between participants.

May 4, 2026

HiTMM: Generative Temporal Masked Modeling of Human Interactive Motions.

Key Points

This work aims to address the challenges in generating realistic two-person interactive motions by capturing the temporal dependencies between participants.
Proposed HiTMM framework that decomposes interactions into separate single-person motions.
Utilizes a shared latent space for motion mapping through a coarse-to-fine approach.
Employs masked and residual transformers to model motion tokens along a shared timeline.
Achieved an FID of 5.017 on the InterHuman dataset, outperforming the state-of-the-art (5.154 for InterMask).
Attained an FID of 0.373 on the InterX dataset, exceeding the performance of InterMask (0.399).

Abstract

We have recently seen some progress in the current field of human-human interaction generation. However, directly generating complex two-person interactive motions remains a significant challenge. Meanwhile, these models typically employ two independent timelines when generating motions for inter active scenarios involving two individuals. This design overlooks the temporal dependencies between motions at each timestep and fails to account for the roles of active and reactive participants during the generation process, often resulting in unrealistic and unnatural motions. In this work, we propose HiTMM, a novel framework for Human interaction generation based on Temporal Masked Modeling. HiTMM first decomposes the human interaction into two separate single-person motions. Individual motions within the interaction belong to the same type, enabling them to be mapped to a shared latent space through a coarse-to-fine approach that produces multi-layer discrete tokens. We then arrange all tokens of the two interacting individuals along a shared timeline. Subsequently, we employ a masked transformer and a residual transformer to model the base-layer and rest-layer motion tokens. Both the base-layer and rest-layer motion tokens are arranged along a single timeline, allowing the model to explicitly capture the temporal order and initiating role embedded in the sequence, where the first individual's motion initiates the interaction. Note that, our model utilizes a shared temporal representation, making it capable of performing temporal editing on specific regions within human interaction sequences. Experimental results show that our model achieves an FID of 5.017 on the InterHuman dataset, surpassing the current state-of-the-art model (vs 5.154 for InterMask), and an FID of 0.373 on the InterX dataset (vs 0.399 for InterMask). Project URL: https://jiaozicheng.github.io/HiTMM/.

Mark Helpful

Bookmark

Relay