Key points are not available for this paper at this time.
Retrieving pedestrian images using natural language descriptions remains challenging due to the prevalence of imperfect annotations in real-world training data. Most existing methods rely on the strong assumption of perfectly aligned image–text pairs, largely ignoring the detrimental impact of annotation noise, which typically manifests as coarse-grained descriptions and erroneous correspondences. These imperfections severely degrade model performance and generalization. To address these issues, we propose a novel framework centered on two key innovations. First, we develop a probabilistic noise identification mechanism that employs a dual-channel Gaussian mixture model (GMM) to assess alignment consistency at both global and local feature levels. Second, for samples identified as noisy, we implement a description synthesis pipeline that leverages a multimodal large language model (MLLM) to generate refined descriptions. A dynamic semantic consistency module then filters these synthesized texts to ensure quality. Comprehensive evaluations on three benchmark datasets—CUHK-PEDES, ICFG-PEDES, and RSTPReid—demonstrate the superior performance of our method: ICFG-PEDES Rank-1 = 68.13%, Rank-5 = 83.39%, Rank-10 = 89.02%; RSTPReid Rank-1 = 66.31%, Rank-5 = 86.87%, Rank-10 = 92.01%; CUHK-PEDES Rank-1 = 75.98%, Rank-5 = 90.34%, Rank-10 = 94.32%. These results show consistent top-k improvements over prior methods and validate the effectiveness of the proposed noise-aware pseudo-text augmentation.
Yu et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: