Unsupervised visible infrared person reidentification (USVI-ReID) is a challenging retrieval task that retrieves cross-modality pedestrian images without using any label information. In this task, the large cross-modality variance makes it difficult to generate reliable cross-modality labels, and the lack of annotations also provides additional difficulties for learning modality-invariant features. To facilitate this unsupervised cross-modal learning, we begin by leveraging the information contained in the cross-modality input and its predicted label. Aiming to minimize information loss, we optimize the model by incorporating entropy minimization, uniform label distribution, and cross-modality matching. In our approach, we design a loop iterative training strategy alternating between model training and cross-modality matching, where a uniform prior guided optimal transport assignment is proposed to select matched visible and infrared prototypes. This matching information is then utilized to minimize the intra- and cross-modality entropy. As a result, our model can gradually self-learn useful information, enabling it to generate discriminative representations for unlabeled cross-modal data. Extensive experimental results on benchmarks demonstrate the effectiveness of our method, e.g., 69.4% and 89.4% of Rank-1 accuracy on SYSU-MM01 and RegDB without any annotations. The code will be released soon.
Zhang et al. (Thu,) studied this question.