What question did this study set out to answer?

This work aims to classify clothing by occasion using advanced computer vision techniques and compare various models' effectiveness.

April 19, 2026Open Access

Occasion-Based Clothing Classification Using Vision Transformer and Traditional Machine Learning Models

Puntos clave

This work aims to classify clothing by occasion using advanced computer vision techniques and compare various models' effectiveness.
Created a balanced dataset of 15,000 clothing images from Fashionpedia.
Used automated classification and manual labeling for image annotations.
Implemented preprocessing steps like resizing, normalization, and background removal.
Tested traditional models (ANNs, SVMs, KNNs) and deep learning models (CNNs, VGG16, ViT) using the same data processing.
Evaluated models based on accuracy with different labeling methods.
Traditional models achieved moderate accuracy between 54% and 66%.
The ViT model showed an accuracy of 81.78% with automated classification.
ViT with manual labeling reached 98.09% accuracy.
Higher labeling accuracy combined with preprocessing steps improved model performance significantly.

Resumen

Clothing classification by occasion is an important area in computer vision and artificial intelligence (AI). This task is particularly challenging because of the subtle visual similarities among clothing categories such as formal, party, and casual attire. Variations in color, fabric, patterns, and lighting further increase the complexity of this task. To address this challenge, we used the Fashionpedia dataset to create a balanced subset of 15,000 images. Specifically, we adopted two different methods for labeling these images: automated classification, which relies on category identifications (IDs) and components, and manual labeling performed by human annotators. We then implemented our preprocessing pipeline, which includes several steps: resizing, image normalization, background removal using segmentation masks, and class balancing. We benchmarked traditional models, including artificial neural networks (ANNs), support vector machines (SVMs), and k-nearest neighbors (KNNs), which use a histogram of oriented gradient (HOG) features, as well as deep learning models such as convolutional neural networks (CNNs), the Visual Geometry Group 16 (VGG16) model utilizing transfer learning, and the vision transformer (ViT) model, all evaluated using identical data splits and preprocessing procedures. The traditional models achieved moderate accuracy, ranging from 54% to 66%. In contrast, the ViT model achieved an accuracy of 81.78% with automated classification and 98.09% with manual labeling. This indicates that a higher label accuracy, along with the preprocessing steps used, significantly enhances the performance. Together, these factors improve the effectiveness of ViT in context-aware apparel classification and establish a reliable baseline for future research.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo