What question did this study set out to answer?

This research aims to identify the best representation for recognizing musical intervals based on frequency relationships and robustness to variations.

June 3, 2026Open Access

Feature engineering and deep learning for musical interval recognition through spectral analysis

Key Points

This research aims to identify the best representation for recognizing musical intervals based on frequency relationships and robustness to variations.
Compared three time-frequency representations: mel spectrogram, constant-Q transform (CQT), and harmonic CQT (HCQT).
Utilized a convolutional architecture on a synthetic multi-instrument dataset for twelve-class harmonic interval recognition.
Applied transfer learning from pretrained audio models and performed fine-tuning on near-field recordings.
HCQT achieved approximately 99.6% zero-shot accuracy on held-out test data; fine-tuning increased accuracy to about 99.9%.
Transfer learning from large-scale pretrained models did not outperform the HCQT trained from scratch.
The baseline model based on FFT achieved at best about 85% accuracy on recorded data, with systematic confusions noted.

Abstract

Abstract Musical intervals correspond to ratios between two fundamental frequencies, which makes their automatic recognition dependent on representations that preserve this relationship while remaining robust to variations in instrument timbre and recording conditions. To investigate which representation best supports this requirement, we compare three time–frequency representations: the mel spectrogram, the constant-Q transform (CQT), and the harmonic CQT (HCQT). The comparison uses a single convolutional architecture, and instrument-held-out splits on a synthetic multi-instrument dataset for twelve-class harmonic interval recognition. The HCQT, which stacks the first five harmonic layers of the CQT, consistently outperforms mel and CQT and achieves near-perfect accuracy on unseen instruments. Transfer learning from large-scale pretrained audio models, including PANNs and the Audio Spectrogram Transformer, does not surpass the HCQT model trained from scratch on the same data. A model trained on synthetic data transfers successfully to real microphone recordings with minimal loss of performance. Zero-shot accuracy on a held-out mid-field recorded test set reaches approximately 99.6%, and fine-tuning on a small amount of near-field recorded data increases accuracy to about 99.9%. An interpretability analysis using Grad-CAM and embedding visualisation shows that the HCQT-based model focuses on time–frequency regions corresponding to the two fundamentals and their harmonics. The embedding space is organised by interval class, while synthetic and recorded samples of the same class occupy similar regions. By contrast, a baseline based on a single-frame FFT feature vector and a multilayer perceptron trained on both synthetic and microphone recordings achieves at best about 85% accuracy on the recorded test set. The baseline exhibits systematic confusions between neighbouring intervals and between inversion pairs, while the tritone is recognised most reliably because it is its own inversion. These results suggest that representations that explicitly encode logarithmic frequency structure and harmonic relationships are well suited to interval recognition and to transfer from synthetic to recorded audio, and that the limitations of the FFT baseline are musically interpretable.

Bookmark

View Full Paper

Cite This Study

Shanurina et al. (Sun,) studied this question.

synapsesocial.com/papers/6a1fc58bdee9eb8c0dce6ebb https://doi.org/https://doi.org/10.1007/s44163-026-01499-3

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark

View Full Paper