Contrastive Language-Image Pre-training (CLIP) has achieved remarkable results in the field of person re-identification (ReID) due to its excellent cross-modal understanding ability and high scalability. Since the text encoder of CLIP mainly focuses on easy-to-describe attributes such as clothing, and clothing is the main interference factor that reduces the recognition accuracy in cloth-changing person ReID (CC ReID). Consequently, directly applying CLIP to cloth-changing scenario may be difficult to adapt to such dynamic feature changes, thereby affecting the precision of identification. To solve this challenge, we propose a CLIP-based multi-modal feature learning framework (CMFF) for CC ReID. Specifically, we first design a pose-aware identity enhancement module (PIE) to enhance the model's perception of identity-intrinsic information. In this branch, to weaken the interference of clothing information, we apply a ranking loss to minimize the difference between appearance and pose in the feature space. Secondly, we propose a global-local hybrid attention module (GLHA) , which fuses head and global features through a cross-attention mechanism, enhancing the global recognition ability of key head information. Finally, considering that existing CLIP-based methods often ignore the potential importance of shallow features, we propose a graph-based multi-layer interactive enhancement module (GMIE), which groups and integrates multi-layer features of the image encoder, aiming to enhance the contextual awareness of multi-scale features. Extensive experiments on multiple popular pedestrian datasets validate the outstanding performance of our proposed CMFF.
Zhang et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: