Person re-identification (ReID) plays a crucial role in computer vision-based surveillance systems, enabling the accurate identification of individuals across multiple camera views. Traditional convolutional neural network (CNN)-based approaches, such as those utilizing ResNet-50, struggle to capture long-range dependencies and contextual relationships, limiting their effectiveness in diverse real-world scenarios. To overcome these challenges, recent advancements have explored Vision Transformer (ViT)-based architectures, leveraging self-attention mechanisms for enhanced feature representation. In this research, we introduce a ViT-based framework, namely ViTC-UReID, for unsupervised person ReID by incorporating a camera-aware proxy learning mechanism to improve feature consistency across different camera viewpoints. Moreover, ViTC-UReID also uses clustering algorithms to generate pseudo labels for samples in training datasets. Our approach significantly enhances cross-camera adaptation, reducing domain shift effects while maintaining strong feature discrimination. We evaluate our method on three widely used benchmarks Market-1501, MSMT17, and CUHK03, demonstrating its superior performance compared to existing state-of-the-art unsupervised methods, particularly those utilizing camera identity cues. Furthermore, our model achieves competitive accuracy with fully supervised methods, highlighting the effectiveness of transformer-based representations in complex person ReID scenarios. Our findings reinforce the growing potential of unsupervised person ReID methods and demonstrate that ViT architectures combined with camera-aware learning can drive substantial improvements in person ReID.
Pham et al. (Tue,) studied this question.