Natural social interactions involve two agents exhibiting smooth and diverse behaviors that align with each other's intent in real time. Creating this level of expressiveness in human–robot interaction (HRI) requires a robot to go beyond simple reactive behaviors and instead anticipate the rich distribution of possible human actions, enabling responses that are diverse, human-like, and socially aligned. This thesis bridges the gap between complex generative modeling and actual robotic deployment by integrating visual perception, context-aware motion generation, and physical-hardware execution into a single coherent system. At the core of the system lies a latent diffusion framework designed for the joint generation of two-person social interactions. Given past context and a high-level interaction description, our model generates potential future motions for both agents in an interdependent manner. By operating within a temporally coherent latent space, the framework ensures smooth, aligned motion segments while significantly reducing computational overhead to support live interaction. To achieve real-time generation, the model is integrated into a continuous streaming pipeline that combines chunked diffusion inference with real-time SMPL-X pose estimation from a single RGBD camera, eliminating the need for restrictive motion capture systems and enabling continuous prediction from live human input. The framework is demonstrated both in simulation and through real-world experiments with Tiago++ and Unitree G1 robots, with generated reactor motion retargeted online to each platform's embodiment. Ultimately, this thesis provides a robust solution for diverse and responsive motion generation, advancing the development of socially aware robots capable of engaging with humans naturally and adaptively under realistic conditions.
Sergej Stanovcic (Fri,) studied this question.