What type of study is this?

September 10, 2025

Toward Disentangled and Controllable Deep Metric Learning With Human-Like Concept Decomposition

Key Points

CMN effectively disentangles visual concepts, enhancing the interpretability of image embeddings.
State-of-the-art performance in image retrieval applications demonstrates significant advancements over existing methods.
The cross-attention mechanism associates concept vectors with regional visual features, improving control over embeddings.
Methodological innovations allow for more flexible and controllable applications in deep metric learning.

Abstract

Deep metric learning (DML) has shown significant advancements in learning discriminative embeddings for images, playing a crucial role in various vision tasks. However, existing methods typically rely on deep neural networks to extract holistic embeddings, which are challenging to disentangle and interpret. To address this issue, we take inspiration from human cognition, where objects are decomposed into distinct concepts for better understanding. Specifically, we propose the concept metrics network (CMNs) to achieve disentangled and controllable DML. CMN begins by initializing learnable concept vectors to represent various visual concepts. These vectors are then associated with regional visual features via cross-attention mechanism, ensuring each vector corresponds to specific visual properties. Finally, the concept values, determined by their presence in the image, form the output embedding. Comprehensive experiments demonstrate that CMN effectively disentangles visual concepts, with each embedding dimension corresponding to a specific concept. Our method not only outperforms existing state-of-the-art methods in conventional DML application (i.e., image retrieval), but also enables more flexible and controllable application. The code is available at https://github.com/shchen0001/CMN.

Mark Helpful

Bookmark

Relay