The acquisition of piano performance skills relies on continuous practice and precise feedback, yet traditional manual evaluation is constrained by time costs and subjective variations, making it difficult to meet the demands of large-scale music education. This study proposes a self-supervised multimodal Transformer framework whose core contribution is the fusion across audio spectral features, symbolic MIDI representations, and a MIDI-derived spatial/kinematic proxy, demonstrating cross-modal attention’s ability to exploit heterogeneous representations under controlled conditions through adaptive fusion mechanisms. Since the MAESTRO dataset lacks video recordings, hand posture features are synthetically derived from MIDI parameters rather than captured from independent visual sensors, representing a kinematic proxy for validating multimodal fusion concepts under controlled conditions. The two-stage training strategy employs contrastive learning, masked prediction, and temporal reconstruction objectives to learn general-purpose music representations during the pretraining phase, and optimizes fine-grained detection capabilities for five error categories of pitch, timing, dynamics, touch, and pedal during the fine-tuning phase, significantly reducing dependence on large-scale annotated data. Experiments on the public MAESTRO dataset validated the substantial advantages of multimodal fusion over unimodal approaches, with the self-supervised pretraining strategy demonstrating stronger generalization capabilities under limited annotation scenarios. Difficulty-level comparison experiments confirmed the model’s robustness in complex performance contexts. The core contribution lies in demonstrating cross-modal attention’s ability to fuse heterogeneous representations across audio, symbolic MIDI, and a MIDI-derived spatial/kinematic proxy under controlled conditions; these findings do not imply that video-based hand pose observation would necessarily yield similar gains, which remains future work.
Yang et al. (Sun,) studied this question.