What does this research mean for the field?

The proposed multi-grained vision–language alignment framework significantly improves domain generalised person re-identification performance by effectively extracting fine-grained visual features. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The main aim is to enhance domain generalised person re-identification using a vision-language approach.

March 10, 2026Open Access

Multi‐Grained Vision–Language Alignment for Domain Generalised Person Re‐Identification

Key Points

The main aim is to enhance domain generalised person re-identification using a vision-language approach.
Proposed a CLIP-based multi-grained vision–language alignment framework.
Introduced multiple prompts to describe different body parts in language.
Employed an adaptively masked multi-head self-attention module for feature extraction.
Utilized an MLLM-based visual grounding expert for generating pseudo labels.
The proposed method showed significant performance improvements in person re-identification tasks.
Experiments conducted on single- and multi-source generalization protocols confirmed the benefits of the approach.

Abstract

ABSTRACT Domain generalised person re‐identification (DG Re‐ID) is a challenging task, where models are trained on source domains but tested on unseen target domains. Although previous pure vision‐based models have achieved significant progress, the performance can be further improved. Recently, vision‐language models (VLMs) present outstanding generalisation capabilities in various visual applications. However, directly adapting a VLM to Re‐ID shows limited generalisation improvement. This is because the VLM only produces global features that are insensitive to ID nuances. To tackle this problem, we propose a CLIP‐based multi‐grained vision–language alignment framework in this work. Specifically, several multi‐grained prompts are introduced in language modality to describe different body parts and align with their counterparts in vision modality. To obtain fine‐grained visual information, an adaptively masked multi‐head self‐attention module is employed to precisely extract specific part features. To train the proposed module, an MLLM‐based visual grounding expert is employed to automatically generate pseudo labels of body parts for supervision. Extensive experiments conducted on both single‐ and multi‐source generalisation protocols demonstrate the superior performance of our approach. The implementation code will be released at https://github.com/RikoLi/MUVA .

Multi‐Grained Vision–Language Alignment for Domain Generalised Person Re‐Identification

Key Points

Abstract

Cite This Study