Deep metric learning (DML) has shown significant advancements in learning discriminative embeddings for images, playing a crucial role in various vision tasks. However, existing methods typically rely on deep neural networks to extract holistic embeddings, which are challenging to disentangle and interpret. To address this issue, we take inspiration from human cognition, where objects are decomposed into distinct concepts for better understanding. Specifically, we propose the concept metrics network (CMNs) to achieve disentangled and controllable DML. CMN begins by initializing learnable concept vectors to represent various visual concepts. These vectors are then associated with regional visual features via cross-attention mechanism, ensuring each vector corresponds to specific visual properties. Finally, the concept values, determined by their presence in the image, form the output embedding. Comprehensive experiments demonstrate that CMN effectively disentangles visual concepts, with each embedding dimension corresponding to a specific concept. Our method not only outperforms existing state-of-the-art methods in conventional DML application (i.e., image retrieval), but also enables more flexible and controllable application. The code is available at https://github.com/shchen0001/CMN.
Chen et al. (Wed,) studied this question.