What question did this study set out to answer?

The aim is to develop a robust framework for instrument-wise music similarity representation that optimizes both source separation and human preferences.

May 14, 2026

Instrument-wise music similarity representation learning with source separation and human preference

Key Points

The aim is to develop a robust framework for instrument-wise music similarity representation that optimizes both source separation and human preferences.
Developed an end-to-end fine-tuning approach for the Cascade architecture to reduce error propagation.
Introduced a multi-task Direct model that reconstructs individual stems from their embeddings.
Incorporated perception-aware fine-tuning using human preference data to enhance similarity representation.
Cascade with E2E-FT showed significant improvement in objective InMSRL metrics.
Multi-task Direct model further enhanced disentanglement performance.
Perception-aware fine-tuning led to marked improvements in perceptual InMSRL performance.

Abstract

This study presents an instrument-wise music-similarity representation learning (InMSRL) framework that leverages music source separation and human preference data without requiring clean stems at inference. Conventional methods follow either the Cascade architecture, which separates sources prior to per-instrument similarity extraction and thus propagates separation errors, or a Direct model that attempts single-stage disentanglement but often fails to isolate target instrument features. We enhance both paradigms. First, we introduce end-to-end fine-tuning (E2E-FT) for the Cascade architecture, jointly optimizing the separator and similarity extractors with an auxiliary separation loss to mitigate the error propagation. Second, we propose a multi-task Direct model that reconstructs individual stems from its own embeddings, thereby sharpening instrument-level disentanglement. Finally, we incorporate perception-aware fine-tuning (PAFT) using human preference data to improve perceptual similarity representation. Experimental evaluations show that (1) Cascade with E2E-FT significantly boosts objective InMSRL metrics, (2) the multi-task Direct variant further enhances disentanglement performance, (3) PAFT markedly elevates perceptual InMSRL performance, and (4) the combined Cascade with E2E-FT and PAFT consistently outperforms all other models. Work supported by JST AIP Acceleration Research JPMJCR25U5, Japan.

Mark Helpful

Bookmark

Relay