In this study, we investigate perceptual music similarity when focusing on individual instrumental parts, and we clarify how these part-based similarities are related to the track-level similarity and the deep learning model prediction. We conducted a large-scale listening test with 632 participants evaluating the perceptual music similarity through an ABX test. Perceptual music similarity was evaluated from four perspectives: timbre, rhythm, melody, and overall. Our analysis revealed four main findings: (1) the relative perceptual music similarity among three tracks (i.e., the similarity comparison between an X–A pair and an X–B pair) varies depending on the instrumental part that listeners focus on; (2) the instrumental parts that predominantly affect perceptual music similarity differ across triplets of tracks; (3) rhythm and melody tend to have a larger impact on perceptual music similarity for each instrumental part than timbre; and (4) similarity features extracted from a music segment using either our previously developed deep embedding method or a large pre-training model (MERT) tend to capture timbre and melody rather than rhythm. Work partly supported by JST CREST JPMJCR19A3, JST AIP Acceleration Research JPMJCR25U5, and a Grant-in-Aid for JSPS Fellows JP24KJ1253, Japan.
Hashizume et al. (Wed,) studied this question.