In practical automatic music transcription, detecting multiple pitches for each of multiple instruments within a music signal is essential. This task is addressed by multi-instrument multi-pitch estimation (MI-MPE). Our approach involves constructing a deep learning model for MI-MPE that leveragesthe outputs from two models: the MPE model performing instrument-agnostic multi-pitch estimation and the IR model performing pitch-agnosticframe-level instrument recognition. First, as a preliminary experiment, we developed a baseline MI-MPE model that does not incorporate MPE or IR outputs and compared its performance with that of the MPE and IR models. All three models share similar architectures consisting of convolutional layers and Transformer blocks. For comparison, the MI-MPE outputs are projected into the MPE or IR formats by applying a max operation along the instrument or pitch dimension, respectively. Experimental results showed that the projected outputs of the MI-MPE model exhibit comparable or higher precision, but significantly lower recall and lower average precision compared to the MPE and IR models. These results suggest that the baseline MI-MPE model has difficulty detecting certain pitches and instruments. To address this, we constructed a new MI-MPE model incorporating outputs from the MPE and IR models as auxiliary inputs and evaluated its performance.
Ogura et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: