What question did this study set out to answer?

May 24, 2026Open Access

Transformer-based real-time automatic error annotation for piano performance

Key Points

This research seeks to enhance real-time automatic error detection in piano performance by developing an innovative framework.
Introduced the DiffAlign-Transformer framework for error detection and alignment.
Used a hierarchical cross-modal encoder to jointly learn note alignment and error classification.
Evaluated on the Vienna Synchronous Library dataset with a leave-one-performer-out validation strategy.
Achieved an overall F1-score of 0.872, exceeding the strongest baseline by 6.0%.
Improved onset error recognition by 7.2% and offset error recognition by 8.1%.
Required only 78 milliseconds per second of audio for inference, meeting real-time requirements.

Abstract

This research tackles the pressing challenge of real-time automatic error detection in piano performance, a task where conventional approaches often propagate inaccuracies due to the decoupling of audio-score alignment and error identification.This paper introduce the DiffAlign-Transformer framework, which incorporates a differentiable dynamic programming mechanism to jointly learn probabilistic note-level alignment and error classification within a hierarchical cross-modal encoder.Evaluated on the Vienna Synchronous Library dataset using a leave-one-performer-out validation strategy, the model attains an overall F1-score of 0.872, exceeding the strongest baseline by 6.0%, with marked gains in onset (7.2%) and offset (8.1%) error recognition.Inference requires only 78 milliseconds per second of audio, satisfying strict real-time constraints.These outcomes confirm that our method successfully resolves the intertwined alignment-detection problem and delivers precise, instantaneous feedback for piano pedagogy.

Bookmark

View Full Paper