The proliferation of unmanned aerial vehicles (UAVs) poses escalating security threats across critical infrastructures, necessitating robust real-time detection systems. Existing vision-based methods predominantly rely on single-modality data and exhibit significant performance degradation under challenging scenarios. To address these limitations, we propose DCAM-DETR, a novel multimodal detection framework that fuses RGB and thermal infrared modalities through an enhanced RT-DETR architecture integrated with state space models. Our approach introduces four innovations: (1) a MobileMamba backbone leveraging selective state space models for efficient long-range dependency modeling with linear complexity O(n); (2) Cross-Dimensional Attention (CDA) and Cross-Path Attention (CPA) modules capturing intermodal correlations across spatial and channel dimensions; (3) an Adaptive Feature Fusion Module (AFFM) dynamically calibrating multimodal feature contributions; and (4) a Dual-Attention Decoupling Module (DADM) enhancing detection head discrimination for small targets. Experiments on Anti-UAV300 demonstrate state-of-the-art performance with 94.7% mAP@0.5 and 78.3% mAP@0.5:0.95 at 42 FPS. Extended evaluations on FLIR-ADAS and KAIST datasets validate the generalization capacity across diverse scenarios.
Qin et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: