This study presents an interactive AI-driven framework for real-time piano music generation from human body motion, establishing a coherent link between physical gesture and computational creativity. The proposed system integrates computer vision–based motion capture with sequence-oriented deep learning to translate continuous movement dynamics into structured musical output. Human pose is extracted using MediaPipe, while OpenCV is employed for temporal motion tracking to derive three-dimensional skeletal landmarks and velocity-based features that modulate musical expression. These motion-derived signals condition a Long Short-Term Memory (LSTM) network trained on a large corpus of classical piano MIDI compositions, enabling the model to preserve stylistic coherence and long-range musical dependencies while dynamically adapting tempo and rhythmic intensity in response to real-time performer movement. The data processing pipeline includes MIDI event encoding, sequence segmentation, feature normalization, and multi-layer LSTM training optimized using cross-entropy loss and the RMSprop optimizer. Model performance is evaluated quantitatively through loss convergence and note diversity metrics, and qualitatively through assessments of musical coherence and system responsiveness. Experimental results demonstrate that the proposed LSTM-based generator maintains structural stability while producing diverse and expressive musical sequences that closely reflect variations in motion velocity. By establishing a closed-loop, real-time mapping between gesture and sound, the framework enables intuitive, embodied musical interaction without requiring traditional instrumental expertise, advancing embodied AI and multimodal human–computer interaction while opening new opportunities for digital performance, creative education, and accessible music generation through movement.
Bukaita et al. (Mon,) studied this question.