What question did this study set out to answer?

This research aims to unify text-to-image retrieval by effectively matching visible and infrared images using attribute-guided methods.

May 15, 2026

Text-Visible/Infrared Person Retrieval: Attribute-Guided Feature Decoupling and Collaborative Alignment and A Unified Benchmark

Key Points

This research aims to unify text-to-image retrieval by effectively matching visible and infrared images using attribute-guided methods.
Developed a novel framework called ACANet for attribute-guided feature decoupling and alignment.
Created the MM01LLCM-Text dataset containing visible and infrared images with detailed text descriptions.
Applied cross-modal learning approaches through extended masked language modeling for fine-grained alignment.
ACANet outperformed existing state-of-the-art methods on the MM01LLCM-Text dataset.
Achieved significant improvements in retrieval accuracy across both visible and infrared modalities.
Effectiveness of attribute-guided approach demonstrated in collaborative alignment tasks.

Abstract

research on text-to-image person retrieval primarily focuses on visible images, which are not suitable under low-light scenarios. Infrared imaging becomes necessary in many visual systems, and matching text with both visible and infrared images is required. However, visible and infrared images are heterogeneous with different visual characteristics, so matching text with them in a unified framework is very challenging. In this work, we design a new task called Text-Visible/Infrared person retrieval and contribute a novel approach and a unified benchmark to promote the research and development of this field. On one hand, we propose a novel Attribute-guided feature decoupling and Collaborative Alignment Network (ACANet) that pursues accurate alignment from the text modality to both visible and infrared modalities in a unified framework according to the texture and color attribute information of text descriptions. In particular, we decouple the color features of visible images supervised by the text labels and integrate them into the infrared features to eliminate the impact of the absence of color information in infrared images during cross-modal collaborative alignment. Moreover, we also decouple the texture information from visible images supervised by the text labels and perform the collaborative alignment of texture and infrared features with a fusion agent. In addition, we extend conventional masked language modeling to a cross-modal paradigm to help ACANet learn uniform fine-grained alignment in multiple image modalities. On the other hand, we contribute a unified high-quality MM01LLCM-Text dataset, which provides person images in both visible and infrared modalities paired with fine-grained text descriptions. Experimental results show that the proposed ACANet outperforms existing state-of-the-art methods on MM01LLCM-Text dataset.

AI에게 질문

Bookmark

AI에게 질문

Bookmark

Text-Visible/Infrared Person Retrieval: Attribute-Guided Feature Decoupling and Collaborative Alignment and A Unified Benchmark

Key Points

Abstract

Cite This Study