This study investigates perceptual similarity at two levels: music tracks (track-level) and the individual instrumental parts that compose them (part-level). A previous work performed a study on perceptual part-level similarity toward developing a model that estimates part-level similarity. An ABX-style listening test with 632 participants was conducted, which evaluated similarity at both levels from the perspectives of timbre, melody, rhythm and overall. Although a previous work contributed some knowledge from the evaluations, further insights are needed to support the development of future estimation models. Specifically, important questions remain regarding the correspondence between track- and part-level similarity, the generalizability of findings across multiple models, and the validity of the conventional learning method in terms of perceptual similarity. This study revealed the following key findings: (1) the instrumental parts that predominantly affect the track-level similarity differ across music triplets and listeners, with the influence of the differences across music triplets exceeding the differences across listeners, indicating that part-level similarity helps in estimating track-level similarity; (2) when a temporal averaging is applied, the output of the deep learning models shows a closer correspondence with the perceptual evaluation based on timbre than on rhythm, indicating a potential area for improvement in the models; (3) the similarity between temporally distinct segments within the same music track is significantly perceived to be significantly higher than that between segments from different tracks, which supports the assumption of the conventional unsupervised learning method developed for music similarity estimation.
Hashizume et al. (Thu,) studied this question.