Multimodal perception and fusion play a vital role in unmanned aerial vehicle (UAV) object detection. Existing methods typically adopt global fusion strategies across modalities. However, due to illumination variation, the effectiveness of RGB and infrared modalities may differ across local regions within the same image, particularly in UAV perspectives where occlusions and dense small objects are prevalent, leading to suboptimal performance of global fusion methods. To address this issue, we propose an adaptive fine-grained fusion network for multimodal UAV object detection. First, we design a local feature consistency-based modality fusion module, which adaptively assigns local fusion weights according to the structural consistency of high-response regions across modalities, thereby enabling more effective aggregation of object-relevant features. Second, we introduce a mutual information-guided feature contrastive loss to encourage the preservation of modality-specific information during the early training phase. Experimental results demonstrate that the proposed method effectively addresses the issue of object occlusion in UAV perspectives, achieving state-of-the-art performance on multimodal UAV object detection benchmarks. Code will be available at https://github.com/lingf5877/AFFNet.
Tang et al. (Thu,) studied this question.