What question did this study set out to answer?

The aim is to improve fashion attribute recognition accuracy by addressing semantic and spatial issues in existing models.

March 25, 2026Open Access

SLAR-Net: A Hierarchical Network with Spatial and Semantic Fusion for Fashion Attribute Recognition

Key Points

The aim is to improve fashion attribute recognition accuracy by addressing semantic and spatial issues in existing models.
Developed a hierarchical multi-label classification network named SLAR-Net.
Introduced a lightweight backbone network with a custom attention mechanism for low-level feature extraction.
Created an adjacency matrix for capturing spatial orientations of fashion attributes using a graph convolutional network.
Designed a graph embedding matrix to model attribute dependency relationships for high-level feature learning.
Implemented a custom multi-head attention mechanism for fusing features.
SLAR-Net outperformed existing state-of-the-art methods in recognition accuracy.
The hierarchical architecture showed significant improvements in feature interaction and recognition performance.

Abstract

With the rapid growth of fashion e-commerce, fashion attribute recognition has emerged as a critical research area in computer vision. Existing methods face two primary problems: (1) building multi-task models, leading to complex network architectures; (2) the overlooking of semantic relationships and spatial positional dependencies between fashion attributes. To address these issues, this paper proposes SLAR-Net, a novel hierarchical multi-label classification network that effectively fuses spatial and semantic information for improved recognition performance. Specifically, SLAR-Net adopts a progressive, hierarchical architecture. Firstly, we introduce a lightweight backbone network enhanced with a custom-designed attention mechanism to extract low-level image features. Secondly, we innovatively construct an adjacency matrix to represent the relative spatial orientations of attributes, which is then employed by a graph convolutional network to model mid-level spatial positional features. Thirdly, we design a graph embedding matrix that captures attribute dependency relationships, leveraging a neural network to learn high-level semantic representations. Finally, we propose a custom multi-head attention mechanism to fuse spatial and semantic features, facilitating enhanced feature interaction and improving recognition performance. Experimental results on fashion attribute and benchmark datasets demonstrate that SLAR-Net outperforms state-of-the-art methods in recognition accuracy, validating the effectiveness of the proposed hierarchical architecture and fusion strategy.

SLAR-Net: A Hierarchical Network with Spatial and Semantic Fusion for Fashion Attribute Recognition

Key Points

Abstract

Cite This Study