Pedestrian detection under low illumination and complex environments remains a significant challenge for vision-based systems, particularly in safety-critical applications such as urban rail transit. To address the limitations of single-modality detection in adverse conditions, this paper proposes IVIFusion, a lightweight yet robust pedestrian detection framework that fuses infrared and visible images at the feature level. The method integrates a dual-branch Transformer-based backbone for modality-specific feature extraction and introduces a Cross-Modality Attention Fusion Module (CMAFM) to adaptively enhance cross-modal representations while suppressing noise. Furthermore, a dedicated small-object detection layer is incorporated to improve the recall of distant and occluded pedestrians. Extensive experiments conducted on the public LLVIP dataset and the custom HGPD dataset demonstrate the superior performance of IVIFusion, achieving mAP0.5 scores of 98.6% and 97.2%, respectively. The results validate the effectiveness of the proposed architecture in handling challenging lighting conditions while maintaining real-time efficiency and low computational cost.
Yang et al. (Tue,) studied this question.