Abstract Musical intervals correspond to ratios between two fundamental frequencies, which makes their automatic recognition dependent on representations that preserve this relationship while remaining robust to variations in instrument timbre and recording conditions. To investigate which representation best supports this requirement, we compare three time–frequency representations: the mel spectrogram, the constant-Q transform (CQT), and the harmonic CQT (HCQT). The comparison uses a single convolutional architecture, and instrument-held-out splits on a synthetic multi-instrument dataset for twelve-class harmonic interval recognition. The HCQT, which stacks the first five harmonic layers of the CQT, consistently outperforms mel and CQT and achieves near-perfect accuracy on unseen instruments. Transfer learning from large-scale pretrained audio models, including PANNs and the Audio Spectrogram Transformer, does not surpass the HCQT model trained from scratch on the same data. A model trained on synthetic data transfers successfully to real microphone recordings with minimal loss of performance. Zero-shot accuracy on a held-out mid-field recorded test set reaches approximately 99.6%, and fine-tuning on a small amount of near-field recorded data increases accuracy to about 99.9%. An interpretability analysis using Grad-CAM and embedding visualisation shows that the HCQT-based model focuses on time–frequency regions corresponding to the two fundamentals and their harmonics. The embedding space is organised by interval class, while synthetic and recorded samples of the same class occupy similar regions. By contrast, a baseline based on a single-frame FFT feature vector and a multilayer perceptron trained on both synthetic and microphone recordings achieves at best about 85% accuracy on the recorded test set. The baseline exhibits systematic confusions between neighbouring intervals and between inversion pairs, while the tritone is recognised most reliably because it is its own inversion. These results suggest that representations that explicitly encode logarithmic frequency structure and harmonic relationships are well suited to interval recognition and to transfer from synthetic to recorded audio, and that the limitations of the FFT baseline are musically interpretable.
Shanurina et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: