Background Detecting densely distributed small objects in unmanned aerial vehicle (UAV) aerial imagery poses a persistent challenge in computer vision. Vision Transformers (ViTs), empowered by global self-attention, perform strongly in object detection, but their fixed absolute positional encoding (PE) adapts poorly to scenes where small targets cluster at high density, and redundant encoding dimensions introduce unnecessary computational overhead. Methods This paper presents DSPE-ViT, a lightweight ViT-based detection framework tailored for dense small object detection in UAV imagery. Its core DSPE module comprises two complementary components: a PE Redundancy Pruner that employs learnable soft-gating masks to adaptively suppress redundant PE dimensions, and a Local PE Enhancer that introduces density-aware adaptive-window relative positional encoding to strengthen local spatial perception in high-density regions. Beyond the DSPE module, a Small Object Feature Pyramid Network (SmallObjFPN) integrating SE channel attention with depthwise separable convolutions improves multi-scale feature representation, and the WIoU v3 loss is adopted to refine bounding-box regression for small targets. Results On the VisDrone2019-DET dataset, DSPE-ViT achieves 43.2% mAP@0.5 with only approximately 6.0 M parameters and 15.8 GFLOPs. Cross-domain evaluation on SeaDronesSee yields 30.1% mAP@0.5 under zero-shot transfer and 38.4% after fine-tuning. Conclusion The cross-domain results confirm the generalization capability of the proposed lightweight framework.
Cai et al. (Tue,) studied this question.