What question did this study set out to answer?

To develop an AI framework that generates piano music in real-time from human body motion, linking gesture to musical output.

January 6, 2026Open Access

AI-Powered Music Generation from Sequential Motion Signals: A Study in LSTM-Based Modelling

Key Points

To develop an AI framework that generates piano music in real-time from human body motion, linking gesture to musical output.
Integrated computer vision for motion capture with deep learning techniques
Used MediaPipe for human pose extraction and OpenCV for motion tracking
Employed a Long Short-Term Memory (LSTM) network trained on classical piano MIDI compositions
Implemented a data processing pipeline including MIDI encoding, segmentation, and feature normalization
Evaluated model performance through quantitative metrics and qualitative assessments.
LSTM-based generator maintains structural stability and produces diverse musical sequences
Real-time model adapts tempo and rhythm based on motion velocity
System facilitates embodied musical interaction without traditional skills
Demonstrated coherence in musical output correlating with physical gestures.

Abstract

This study presents an interactive AI-driven framework for real-time piano music generation from human body motion, establishing a coherent link between physical gesture and computational creativity. The proposed system integrates computer vision–based motion capture with sequence-oriented deep learning to translate continuous movement dynamics into structured musical output. Human pose is extracted using MediaPipe, while OpenCV is employed for temporal motion tracking to derive three-dimensional skeletal landmarks and velocity-based features that modulate musical expression. These motion-derived signals condition a Long Short-Term Memory (LSTM) network trained on a large corpus of classical piano MIDI compositions, enabling the model to preserve stylistic coherence and long-range musical dependencies while dynamically adapting tempo and rhythmic intensity in response to real-time performer movement. The data processing pipeline includes MIDI event encoding, sequence segmentation, feature normalization, and multi-layer LSTM training optimized using cross-entropy loss and the RMSprop optimizer. Model performance is evaluated quantitatively through loss convergence and note diversity metrics, and qualitatively through assessments of musical coherence and system responsiveness. Experimental results demonstrate that the proposed LSTM-based generator maintains structural stability while producing diverse and expressive musical sequences that closely reflect variations in motion velocity. By establishing a closed-loop, real-time mapping between gesture and sound, the framework enables intuitive, embodied musical interaction without requiring traditional instrumental expertise, advancing embodied AI and multimodal human–computer interaction while opening new opportunities for digital performance, creative education, and accessible music generation through movement.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Bukaita et al. (Mon,) studied this question.

synapsesocial.com/papers/695d855e3483e917927a4bd2 https://doi.org/https://doi.org/10.11648/j.ijiis.20251406.12

Bookmark

View Full Paper