Object detection in aerial imagery faces extreme target sparsity and high-intensity environmental interference, causing weak targets to be submerged in background clutter. To address this, we propose a Collaborative Feature Purification Detection Transformer (CFP-DETR), which reconstructs discriminative target representations through a collaborative feature purification mechanism. Specifically, the Global Context Denoising Module (GCDM) first suppresses environmental noise at the semantic level to enhance target saliency. The purified features are then fused across scales through an Adaptive Cross-scale Feature Alignment (ACFA) module, which resolves spatial misalignment that otherwise dilutes small-object features during multi-level interaction. Concurrently, a Fine-Grained Detail Injection Module (FGDIM) recovers shallow high-resolution details and injects them into the semantic flow, compensating for information loss caused by progressive downsampling. Together, these modules denoise, align, and recover features to counteract submergence at different stages. Additionally, an efficient lightweight variant, Efficient Lightweight CFP-DETR (EL-CFP-DETR), reconstructs the backbone with partial convolution and structural re-parameterization to improve efficiency while maintaining competitive detection accuracy. Extensive experiments across five datasets validate the effectiveness of this collaborative design. On the SeaDronesSee dataset, CFP-DETR increases AP50 and APSval by 1.64% and 4.03% over the baseline, while EL-CFP-DETR reduces parameters by 18% to 16.4M and GFLOPs by 15% to 48.3, reaching 42.8 FPS. Notably, CFP-DETR achieves an inference speed of 37.72 FPS, a 31.2% improvement over the baseline Real-Time Detection Transformer (RT-DETR).
Wang et al. (Sat,) studied this question.