With the rapid growth of fashion e-commerce, fashion attribute recognition has emerged as a critical research area in computer vision. Existing methods face two primary problems: (1) building multi-task models, leading to complex network architectures; (2) the overlooking of semantic relationships and spatial positional dependencies between fashion attributes. To address these issues, this paper proposes SLAR-Net, a novel hierarchical multi-label classification network that effectively fuses spatial and semantic information for improved recognition performance. Specifically, SLAR-Net adopts a progressive, hierarchical architecture. Firstly, we introduce a lightweight backbone network enhanced with a custom-designed attention mechanism to extract low-level image features. Secondly, we innovatively construct an adjacency matrix to represent the relative spatial orientations of attributes, which is then employed by a graph convolutional network to model mid-level spatial positional features. Thirdly, we design a graph embedding matrix that captures attribute dependency relationships, leveraging a neural network to learn high-level semantic representations. Finally, we propose a custom multi-head attention mechanism to fuse spatial and semantic features, facilitating enhanced feature interaction and improving recognition performance. Experimental results on fashion attribute and benchmark datasets demonstrate that SLAR-Net outperforms state-of-the-art methods in recognition accuracy, validating the effectiveness of the proposed hierarchical architecture and fusion strategy.
Jin et al. (Mon,) studied this question.