research on text-to-image person retrieval primarily focuses on visible images, which are not suitable under low-light scenarios. Infrared imaging becomes necessary in many visual systems, and matching text with both visible and infrared images is required. However, visible and infrared images are heterogeneous with different visual characteristics, so matching text with them in a unified framework is very challenging. In this work, we design a new task called Text-Visible/Infrared person retrieval and contribute a novel approach and a unified benchmark to promote the research and development of this field. On one hand, we propose a novel Attribute-guided feature decoupling and Collaborative Alignment Network (ACANet) that pursues accurate alignment from the text modality to both visible and infrared modalities in a unified framework according to the texture and color attribute information of text descriptions. In particular, we decouple the color features of visible images supervised by the text labels and integrate them into the infrared features to eliminate the impact of the absence of color information in infrared images during cross-modal collaborative alignment. Moreover, we also decouple the texture information from visible images supervised by the text labels and perform the collaborative alignment of texture and infrared features with a fusion agent. In addition, we extend conventional masked language modeling to a cross-modal paradigm to help ACANet learn uniform fine-grained alignment in multiple image modalities. On the other hand, we contribute a unified high-quality MM01LLCM-Text dataset, which provides person images in both visible and infrared modalities paired with fine-grained text descriptions. Experimental results show that the proposed ACANet outperforms existing state-of-the-art methods on MM01LLCM-Text dataset.
Li et al. (Thu,) studied this question.