In oriented object detection from drone imagery, many existing RGB-infrared (RGB-IR) fusion methods derive modality weights from input statistics alone, without regard for downstream detection objectives. We present SGFNet, a Semantic-Guided Fusion Network that feeds detection-level semantics back into the fusion stage through learned importance masks. SGFNet comprises three modules: (1) a Frequency-aware Disentanglement Module (FDM) that separates high-frequency textures from low-frequency thermal structures through Laplacian and Gaussian filtering; (2) a Semantic-Guided Module (SGM) that generates P5-level semantic masks to steer fusion toward detection-critical regions; and (3) an Adaptive Geometric Convolution (AGC) whose rotation-aware sampling matches receptive fields to arbitrarily oriented objects. On the DroneVehicle benchmark (28,439 RGB-IR pairs, five vehicle categories), SGFNet achieves 82.0% mAP@0.5, surpassing the runner-up DMM by 3.2 percentage points while lowering mean angular error from 7.4° to 6.2° (−16%). Ablation analysis attributes the largest single-module gain (+1.7 pp) to the semantic feedback path.
Zhang et al. (Sat,) studied this question.