What question did this study set out to answer?

The aim is to create a framework that enhances music similarity assessment and recommendations through explanation using an ontology.

April 17, 2026Open Access

Ontology-Guided Multimodal Framework for Explainable Music Similarity and Recommendation

Key Points

The aim is to create a framework that enhances music similarity assessment and recommendations through explanation using an ontology.
Developed a multimodal framework combining audio, text, and metadata features.
Utilized strong neural encoders based on transformer models for feature extraction.
Evaluated the framework on various music datasets using content-based retrieval tasks.
Included a baseline comparison using conventional music information retrieval descriptors.
Achieved better precision and recall compared to audio-only and non-ontological models.
Improved mean average precision and covered more rare music content.
Visualizations indicated enhanced interpretability of results through ontology-guided reranking.

Abstract

Analyzing music similarity in large catalogs is challenging because people perceive music differently and important details are found in audio, text, and metadata. This article introduces a multimodal framework that uses an ontology to make music similarity and recommendation more explainable. The framework brings together learned features from audio, lyrics, and other text with structured metadata in a shared similarity space, and then improves ranking with a music ontology that captures relationships between songs, artists, genres, and moods. The design works with any encoder that creates fixed-size features. This study uses strong neural audio and text encoders, mainly based on transformers. This approach allows the system to handle different input types while staying reliable across datasets. This study tests the framework on several open music and audio datasets using content-based retrieval tasks and standard ranking measures. In addition to Configurations C1–C4, this study includes an external content-based reference baseline based on conventional MIR audio descriptors. This baseline represents a signal-level retrieval approach that models complementary aspects of the audio signal, such as timbre, harmony, and spectral characteristics, and is evaluated under the same retrieval protocol as the main framework. It is included to provide an external comparison point outside the proposed C1–C4 design. Compared to audio-only and non-ontological variants within the same framework, the proposed multimodal and ontology-guided configurations achieve better precision, recall, and mean average precision, and also cover more rare content. Visualizations and case studies show that combining different data types and using ontology-based reranking can improve performance and make results easier to interpret. This work lays the groundwork for explainable, cognitively informed music recommendation systems and points to future work in modeling user behavior over time and adapting to different cultures.

Ontology-Guided Multimodal Framework for Explainable Music Similarity and Recommendation

Key Points

Abstract

Cite This Study