What type of study is this?

This is a Quantitative Study study.

October 5, 2025Open Access

ViTC-UReID: Enhancing unsupervised person ReID with vision transformer image encoder and camera-aware proxy learning

Key Points

ViTC-UReID enhances person re-identification by improving feature consistency across camera views.
The model reduces domain shift effects while maintaining strong discrimination, demonstrating effective cross-camera adaptation.
Clustering algorithms generate pseudo labels for training datasets, leveraging self-attention mechanisms in the vision transformer architecture.
Evaluated on benchmarks like Market-1501 and MSMT17, ViTC-UReID outperforms existing unsupervised methods and approaches supervised accuracy.

Abstract

Person re-identification (ReID) plays a crucial role in computer vision-based surveillance systems, enabling the accurate identification of individuals across multiple camera views. Traditional convolutional neural network (CNN)-based approaches, such as those utilizing ResNet-50, struggle to capture long-range dependencies and contextual relationships, limiting their effectiveness in diverse real-world scenarios. To overcome these challenges, recent advancements have explored Vision Transformer (ViT)-based architectures, leveraging self-attention mechanisms for enhanced feature representation. In this research, we introduce a ViT-based framework, namely ViTC-UReID, for unsupervised person ReID by incorporating a camera-aware proxy learning mechanism to improve feature consistency across different camera viewpoints. Moreover, ViTC-UReID also uses clustering algorithms to generate pseudo labels for samples in training datasets. Our approach significantly enhances cross-camera adaptation, reducing domain shift effects while maintaining strong feature discrimination. We evaluate our method on three widely used benchmarks Market-1501, MSMT17, and CUHK03, demonstrating its superior performance compared to existing state-of-the-art unsupervised methods, particularly those utilizing camera identity cues. Furthermore, our model achieves competitive accuracy with fully supervised methods, highlighting the effectiveness of transformer-based representations in complex person ReID scenarios. Our findings reinforce the growing potential of unsupervised person ReID methods and demonstrate that ViT architectures combined with camera-aware learning can drive substantial improvements in person ReID.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper