• A multi-modal detection framework for substation equipment based on YOLOv11 is proposed. • Three novel modules (FIEI, MFSM, CFE) boost multi-modal feature fusion and enhancement. • The method achieves 91.3 % mAP@0.5, outperforming single-modal and mainstream fusion methods. • Strong robustness under adverse weather and registration errors in real inspection scenarios. • Maintains low computational complexity and high real-time performance for engineering use. Reliable detection and localization of substation equipment under normal operating conditions is paramount for the autonomous inspection of power systems. However, traditional single-modal detection methods often suffer from performance degradation under adverse lighting conditions or complex thermal backgrounds. This paper proposes a robust multi-modal information interaction detection framework based on the state-of-the-art YOLOv11 architecture. To effectively leverage complementary information from visible and infrared modalities, three novel modules are integrated: (1) the Feature Information Extraction and Integration (FIEI) module, designed to capture fine-grained spatial and thermal features; (2) the Multi-modal Feature Shunting and Merging (MFSM) module, which adaptively resolves feature conflicts and synchronizes heterogeneous data; and (3) the Cross-modal Feature Enhancement (CFE) mechanism, which employs attention-based interaction to suppress noise in low-quality images.The experimental results on a self-built multimodal dataset of substations show that the accuracy of the proposed method reaches 91.3 %, which is 15.56 % higher than that of the visible light image detection method and 18.38 % higher than that of the infrared image detection algorithm. Compared with the mainstream image fusion detection methods, the detection accuracy is improved by an average of 10.87 %.While maintaining a relatively low computational complexity, it significantly suppresses the phenomena of missed detection and false detection, showing strong performance for equipment localization and detection in normal operation scenarios.
Wang et al. (Mon,) studied this question.