What question did this study set out to answer?

The aim is to create a system for generating diverse and socially aligned motions in human-robot interactions using generative modeling techniques.

May 31, 2026Open Access

Real-World Human-Robot Interaction Behavior Generation using Latent Diffusion Models

Key Points

The aim is to create a system for generating diverse and socially aligned motions in human-robot interactions using generative modeling techniques.
Integrates visual perception, context-aware motion generation, and hardware execution into a coherent system.
Develops a latent diffusion framework for generating joint social interactions based on past contexts and high-level descriptions.
Demonstrates real-time implementation with robots Tiago++ and Unitree G1 using a continuous streaming pipeline.
Successfully generates diverse and aligned motion segments in real time, reducing computational overhead.
Demonstrates effective integration into robotic platforms during both simulation and real-world testing.

Abstract

Natural social interactions involve two agents exhibiting smooth and diverse behaviors that align with each other's intent in real time. Creating this level of expressiveness in human–robot interaction (HRI) requires a robot to go beyond simple reactive behaviors and instead anticipate the rich distribution of possible human actions, enabling responses that are diverse, human-like, and socially aligned. This thesis bridges the gap between complex generative modeling and actual robotic deployment by integrating visual perception, context-aware motion generation, and physical-hardware execution into a single coherent system. At the core of the system lies a latent diffusion framework designed for the joint generation of two-person social interactions. Given past context and a high-level interaction description, our model generates potential future motions for both agents in an interdependent manner. By operating within a temporally coherent latent space, the framework ensures smooth, aligned motion segments while significantly reducing computational overhead to support live interaction. To achieve real-time generation, the model is integrated into a continuous streaming pipeline that combines chunked diffusion inference with real-time SMPL-X pose estimation from a single RGBD camera, eliminating the need for restrictive motion capture systems and enabling continuous prediction from live human input. The framework is demonstrated both in simulation and through real-world experiments with Tiago++ and Unitree G1 robots, with generated reactor motion retargeted online to each platform's embodiment. Ultimately, this thesis provides a robust solution for diverse and responsive motion generation, advancing the development of socially aware robots capable of engaging with humans naturally and adaptively under realistic conditions.

Bookmark

View Full Paper

Bookmark

View Full Paper

Real-World Human-Robot Interaction Behavior Generation using Latent Diffusion Models

Key Points

Abstract

Cite This Study