The rapid proliferation of unmanned aerial vehicles (UAVs) has amplified the need for robust and efficient object detection in diverse aerial environments. However, detecting small objects under complex conditions (e.g., low illumination, cluttered backgrounds, and thermal–visual discrepancies) remains challenging. While many existing detectors emphasize real-time inference, they often rely on weak or late fusion strategies, resulting in suboptimal utilization of complementary multi-modal cues. To address this limitation, we propose DGE-YOLO, an enhanced YOLO-based framework for effective infrared–visible (IR–RGB) multi-modal fusion in UAV object detection. DGE-YOLO adopts a dual-branch architecture for modality-specific feature extraction, preserving modality-aware representations before fusion. To strengthen cross-scale semantics, we introduce an Efficient Multi-scale Attention (EMA) module that improves feature discrimination across spatial resolutions. Furthermore, we replace the conventional neck with a Gather-and-Distribute module to reduce information loss during feature aggregation and improve multi-scale feature propagation. Extensive experiments on the DroneVehicle dataset demonstrate that DGE-YOLO consistently outperforms state-of-the-art baselines, confirming its effectiveness and practicality as an applied multi-modal detection solution for UAV scenarios.
Lv et al. (Mon,) studied this question.