Los puntos clave no están disponibles para este artículo en este momento.
Robust multispectral pedestrian detection remains challenging in complex environments such as those with low illumination, strong thermal contrast, and background clutter. Although RGB–thermal fusion provides complementary cues, lightweight detectors often suffer from unstable feature representation across scales and insufficient control over modality-biased responses during fusion, which can degrade localization accuracy and weaken the detection of small or distant pedestrians. To address these issues, we develop a lightweight stage-wise RGB–thermal fusion pipeline that integrates pre-fusion feature refinement, cross-modal interaction, and post-fusion adaptive recalibration. Specifically, a Multi-scale Feature Refinement (MSFR) module is proposed at the mid-level to enhance modality-specific representations by jointly modeling local details and contextual information, thereby reducing scale-sensitive noise before interaction. An established Cross-Modality Fusion Transformer (CFT) is then adopted to promote semantic correspondence between RGB and thermal features. After interaction, an Adaptive Feature Recalibration (AFR) module is introduced to suppress background-dominated and modality-biased responses through lightweight channel-wise adjustment. Extensive experiments on three public RGB–thermal benchmarks, including the pedestrian-focused KAIST and LLVIP datasets together with the FLIR-aligned road-scene benchmark, demonstrate that the proposed method achieves a favorable accuracy–efficiency trade-off, delivering consistent improvements over the lightweight baseline while maintaining a compact architecture and real-time inference capability.
Song et al. (Sat,) studied this question.