What question did this study set out to answer?

The aim is to enhance RGB-infrared object detection for UAV applications by addressing current integration and localization challenges.

March 1, 2026Open Access

MFE-DETR: Multimodal Feature-Enhanced Detection Transformer for RGB–Infrared Object Detection in Aerial Imagery

Puntos clave

The aim is to enhance RGB-infrared object detection for UAV applications by addressing current integration and localization challenges.
Developed MFE-DETR utilizing RGB and infrared imagery for object detection.
Implemented Dual-Modality Enhancement Module with HWD-Stream for frequency analysis and AKR-Stream for feature refinement.
Enhanced Cross-scale Channel Feature Fusion with Adaptive Feature Fusion Module for dynamic spatial adjustments.
Introduced Bilinear Attention-Enhanced Detection Module for second-order feature interactions.
Achieved 78.6% mAP50 and 57.8% mAP50:95 in experiments, surpassing existing methods by 5.3% and 3.7%.
Demonstrated strong performance on the VisDrone dataset, particularly improving small object detection with 18.6% APS.
Provided detailed insights into the effectiveness of proposed components through comprehensive ablation studies.

Resumen

Multimodal object detection utilizing RGB and infrared (IR) imagery has become a critical research area for unmanned aerial vehicle (UAV) surveillance applications, providing reliable perception under various lighting and environmental conditions. Nevertheless, current methods encounter three primary challenges: (1) insufficient utilization of frequency-domain properties in heterogeneous modalities, (2) restricted adaptability in crossmodal feature integration across different environmental scenarios, and (3) inadequate modeling of fine-grained spatial relationships for accurate object localization. To overcome these limitations, we introduce MFE-DETR, a novel Multimodal Feature-Enhanced Detection Transformer that achieves superior RGB-IR fusion through three complementary innovations. First, we present the Dual-Modality Enhancement Module (DMEM) with two specialized processing streams: the Haar wavelet decomposition stream (HWD-Stream) that conducts multi-resolution frequency-domain analysis to independently enhance low-frequency structural components and high-frequency textural information, and the Attention-guided Kolmogorov–Arnold Refinement Stream (AKR-Stream) that employs learnable spline-parameterized activation functions for adaptive nonlinear feature refinement. Second, we enhance the Cross-scale Channel Feature Fusion module by integrating an Adaptive Feature Fusion Module (AFAM) with complementary gating mechanisms that dynamically adjust modality contributions according to spatial informativeness. Third, we introduce the Bilinear Attention-Enhanced Detection Module (BADM) that models second-order feature interactions through factorized bilinear pooling, facilitating fine-grained crossmodal correlation analysis. Extensive experiments on the DroneVehicle benchmark show that MFE-DETR attains 78.6% mAP50 and 57.8% mAP50:95, outperforming state-of-the-art approaches by 5.3% and 3.7%, respectively. Additional evaluations on the VisDrone dataset further confirm the excellent generalization performance of our method, especially for small object detection with 18.6% APS, achieving a 1.5% improvement over existing techniques. Comprehensive ablation studies and visualizations offer detailed insights into the effectiveness of each proposed component.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo