What question did this study set out to answer?

This research aims to enhance the detection of densely clustered small objects in UAV imagery using a novel lightweight framework.

June 20, 2026Open Access

DSPE-ViT: a lightweight vision transformer with dynamic sparse positional encoding for dense small object detection in UAV imagery

Key Points

This research aims to enhance the detection of densely clustered small objects in UAV imagery using a novel lightweight framework.
Developed the DSPE-ViT framework with a PE Redundancy Pruner and Local PE Enhancer.
Integrated a Small Object Feature Pyramid Network (SmallObjFPN) for better multi-scale feature representation.
Applied the WIoU v3 loss for improved bounding-box regression of small targets.
Achieved 43.2% mAP@0.5 on the VisDrone2019-DET dataset with approximately 6.0 M parameters.
Obtained 30.1% mAP@0.5 under zero-shot transfer and 38.4% after fine-tuning on the SeaDronesSee dataset.

Abstract

Background Detecting densely distributed small objects in unmanned aerial vehicle (UAV) aerial imagery poses a persistent challenge in computer vision. Vision Transformers (ViTs), empowered by global self-attention, perform strongly in object detection, but their fixed absolute positional encoding (PE) adapts poorly to scenes where small targets cluster at high density, and redundant encoding dimensions introduce unnecessary computational overhead. Methods This paper presents DSPE-ViT, a lightweight ViT-based detection framework tailored for dense small object detection in UAV imagery. Its core DSPE module comprises two complementary components: a PE Redundancy Pruner that employs learnable soft-gating masks to adaptively suppress redundant PE dimensions, and a Local PE Enhancer that introduces density-aware adaptive-window relative positional encoding to strengthen local spatial perception in high-density regions. Beyond the DSPE module, a Small Object Feature Pyramid Network (SmallObjFPN) integrating SE channel attention with depthwise separable convolutions improves multi-scale feature representation, and the WIoU v3 loss is adopted to refine bounding-box regression for small targets. Results On the VisDrone2019-DET dataset, DSPE-ViT achieves 43.2% mAP@0.5 with only approximately 6.0 M parameters and 15.8 GFLOPs. Cross-domain evaluation on SeaDronesSee yields 30.1% mAP@0.5 under zero-shot transfer and 38.4% after fine-tuning. Conclusion The cross-domain results confirm the generalization capability of the proposed lightweight framework.

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper

Cite This Study

Cai et al. (Tue,) studied this question.

synapsesocial.com/papers/6a362d32db0793dc1a535a33 https://doi.org/https://doi.org/10.3389/fnbot.2026.1849093

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper