This study presents an instrument-wise music-similarity representation learning (InMSRL) framework that leverages music source separation and human preference data without requiring clean stems at inference. Conventional methods follow either the Cascade architecture, which separates sources prior to per-instrument similarity extraction and thus propagates separation errors, or a Direct model that attempts single-stage disentanglement but often fails to isolate target instrument features. We enhance both paradigms. First, we introduce end-to-end fine-tuning (E2E-FT) for the Cascade architecture, jointly optimizing the separator and similarity extractors with an auxiliary separation loss to mitigate the error propagation. Second, we propose a multi-task Direct model that reconstructs individual stems from its own embeddings, thereby sharpening instrument-level disentanglement. Finally, we incorporate perception-aware fine-tuning (PAFT) using human preference data to improve perceptual similarity representation. Experimental evaluations show that (1) Cascade with E2E-FT significantly boosts objective InMSRL metrics, (2) the multi-task Direct variant further enhances disentanglement performance, (3) PAFT markedly elevates perceptual InMSRL performance, and (4) the combined Cascade with E2E-FT and PAFT consistently outperforms all other models. Work supported by JST AIP Acceleration Research JPMJCR25U5, Japan.
Imamura et al. (Wed,) studied this question.