What question did this study set out to answer?

This study aims to improve the evaluation of piano performance skills by using a self-supervised multimodal framework that can detect multiple types of performance errors.

June 10, 2026Open Access

Self-supervised multimodal transformer for fine-grained detection of controlled perturbation events in piano performance

Key Points

This study aims to improve the evaluation of piano performance skills by using a self-supervised multimodal framework that can detect multiple types of performance errors.
Proposed a self-supervised multimodal transformer framework using audio, MIDI, and spatial features.
Employed contrastive learning, masked prediction, and temporal reconstruction in a two-stage training strategy.
Utilized the MAESTRO dataset for training, focusing on five error categories during fine-tuning.
Multimodal fusion significantly outperformed unimodal approaches in detecting performance errors.
Demonstrated stronger generalization capabilities with self-supervised pretraining under limited annotations.
Confirmed model robustness across varying difficulty levels in complex performance scenarios.

Abstract

The acquisition of piano performance skills relies on continuous practice and precise feedback, yet traditional manual evaluation is constrained by time costs and subjective variations, making it difficult to meet the demands of large-scale music education. This study proposes a self-supervised multimodal Transformer framework whose core contribution is the fusion across audio spectral features, symbolic MIDI representations, and a MIDI-derived spatial/kinematic proxy, demonstrating cross-modal attention’s ability to exploit heterogeneous representations under controlled conditions through adaptive fusion mechanisms. Since the MAESTRO dataset lacks video recordings, hand posture features are synthetically derived from MIDI parameters rather than captured from independent visual sensors, representing a kinematic proxy for validating multimodal fusion concepts under controlled conditions. The two-stage training strategy employs contrastive learning, masked prediction, and temporal reconstruction objectives to learn general-purpose music representations during the pretraining phase, and optimizes fine-grained detection capabilities for five error categories of pitch, timing, dynamics, touch, and pedal during the fine-tuning phase, significantly reducing dependence on large-scale annotated data. Experiments on the public MAESTRO dataset validated the substantial advantages of multimodal fusion over unimodal approaches, with the self-supervised pretraining strategy demonstrating stronger generalization capabilities under limited annotation scenarios. Difficulty-level comparison experiments confirmed the model’s robustness in complex performance contexts. The core contribution lies in demonstrating cross-modal attention’s ability to fuse heterogeneous representations across audio, symbolic MIDI, and a MIDI-derived spatial/kinematic proxy under controlled conditions; these findings do not imply that video-based hand pose observation would necessarily yield similar gains, which remains future work.

Bookmark

View Full Paper

Bookmark

View Full Paper

Self-supervised multimodal transformer for fine-grained detection of controlled perturbation events in piano performance

Key Points

Abstract

Cite This Study