To address the challenges of insufficient robustness in single-modal features and interference from cross-modal disparities in pedestrian re-identification under complex scenarios, we propose a novel network model that integrates joint attention mechanisms and multimodal features. Built upon a residual network backbone, the model introduces a cross-modal self-attention module to adaptively weight features from RGB, thermal infrared, and depth modalities. A multimodal feature fusion module is designed with three branches: intra-modal enhancement, cross-modal correlation, and modal discrepancy suppression, which together construct comprehensive pedestrian feature representations. During optimization, we introduce a combination of modal cosine cross-entropy loss, cross-modal triplet loss, center alignment loss, and modal consistency loss, updating the network using a min-max strategy. The proposed method achieves top-1 accuracy rates of 94.3% and 88.7% on the RegDB and SYSU-MM01 datasets, respectively, demonstrating its effectiveness in multimodal pedestrian re-identification scenarios.
Fan Li (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: