ABSTRACT Domain generalised person re‐identification (DG Re‐ID) is a challenging task, where models are trained on source domains but tested on unseen target domains. Although previous pure vision‐based models have achieved significant progress, the performance can be further improved. Recently, vision‐language models (VLMs) present outstanding generalisation capabilities in various visual applications. However, directly adapting a VLM to Re‐ID shows limited generalisation improvement. This is because the VLM only produces global features that are insensitive to ID nuances. To tackle this problem, we propose a CLIP‐based multi‐grained vision–language alignment framework in this work. Specifically, several multi‐grained prompts are introduced in language modality to describe different body parts and align with their counterparts in vision modality. To obtain fine‐grained visual information, an adaptively masked multi‐head self‐attention module is employed to precisely extract specific part features. To train the proposed module, an MLLM‐based visual grounding expert is employed to automatically generate pseudo labels of body parts for supervision. Extensive experiments conducted on both single‐ and multi‐source generalisation protocols demonstrate the superior performance of our approach. The implementation code will be released at https://github.com/RikoLi/MUVA .
Li et al. (Thu,) studied this question.