Multimodal object detection utilizing RGB and infrared (IR) imagery has become a critical research area for unmanned aerial vehicle (UAV) surveillance applications, providing reliable perception under various lighting and environmental conditions. Nevertheless, current methods encounter three primary challenges: (1) insufficient utilization of frequency-domain properties in heterogeneous modalities, (2) restricted adaptability in crossmodal feature integration across different environmental scenarios, and (3) inadequate modeling of fine-grained spatial relationships for accurate object localization. To overcome these limitations, we introduce MFE-DETR, a novel Multimodal Feature-Enhanced Detection Transformer that achieves superior RGB-IR fusion through three complementary innovations. First, we present the Dual-Modality Enhancement Module (DMEM) with two specialized processing streams: the Haar wavelet decomposition stream (HWD-Stream) that conducts multi-resolution frequency-domain analysis to independently enhance low-frequency structural components and high-frequency textural information, and the Attention-guided Kolmogorov–Arnold Refinement Stream (AKR-Stream) that employs learnable spline-parameterized activation functions for adaptive nonlinear feature refinement. Second, we enhance the Cross-scale Channel Feature Fusion module by integrating an Adaptive Feature Fusion Module (AFAM) with complementary gating mechanisms that dynamically adjust modality contributions according to spatial informativeness. Third, we introduce the Bilinear Attention-Enhanced Detection Module (BADM) that models second-order feature interactions through factorized bilinear pooling, facilitating fine-grained crossmodal correlation analysis. Extensive experiments on the DroneVehicle benchmark show that MFE-DETR attains 78.6% mAP50 and 57.8% mAP50:95, outperforming state-of-the-art approaches by 5.3% and 3.7%, respectively. Additional evaluations on the VisDrone dataset further confirm the excellent generalization performance of our method, especially for small object detection with 18.6% APS, achieving a 1.5% improvement over existing techniques. Comprehensive ablation studies and visualizations offer detailed insights into the effectiveness of each proposed component.
Yan et al. (Fri,) studied this question.