Unmanned aerial vehicle (UAV) red–green–blue–infrared (RGB-IR) object detection is important for traffic monitoring, security surveillance, and urban management, but remains challenging because aerial targets are often small, densely distributed, and affected by complex backgrounds. In addition, RGB and infrared (IR) modalities contribute unequally under different imaging conditions, making simple feature concatenation or indiscriminate middle-layer fusion insufficient for stable cross-modal utilization. To address this problem, this paper proposes Selective Interaction Mechanism and Prefiltering Complementary Spatial Refinement (SIM-PCSR), a key-layer complementary enhancement method for UAV RGB-IR small-object detection. The proposed method decomposes cross-modal modeling into two stages. SIMAdapter first performs selective interaction on the small-object-sensitive P3 layer before fusion, suppressing redundant responses and enhancing potentially complementary modal evidence. PCSR then refines the fused representation through prefiltering, modal selection, and local window residual refinement, injecting reliable complementary information into the key-layer fused feature in a controlled manner. Experiments on the DroneVehicle dataset show that SIM-PCSR achieves 85.323 mean average precision (mAP)50 and 63.572 mAP50:95, improving the Fixed Middle Fusion baseline by 0.523 and 0.751 percentage points, respectively. These gains correspond to relative improvements of 0.62% and 1.20% over the baseline. Module ablation, position ablation, repeated-seed evaluation, category-wise analysis, scale-wise analysis, and qualitative visualization jointly demonstrate that explicit selection and organization of cross-modal information can improve UAV RGB-IR small-object detection under modality imbalance and background interference.
He et al. (Mon,) studied this question.