Object detection in UAV remote sensing imagery is confronted with three primary challenges: severe scale variation, densely clustered small targets, and constrained computational resources. This work introduces a family of lightweight detection models guided by the “Capacity-Aware Configuration Regularity” and incorporates a Feature-Refinement C2f module to enhance representational efficiency. A dynamic coupling mechanism is identified between detection head capacity and the representational quality of Backbone features, which is further validated through systematic ablation studies spanning three parameter magnitudes. Evaluated on the VisDrone2019 benchmark, the proposed model family exhibits a progressive parameter scaling from 1.67 M to 6.15 M. The nano variant achieves 31.7% mAP50 using only 55% of the parameter budget of YOLOv8n, surpassing it by 0.7 percentage points. The small variant, with a parameter budget comparable to YOLOv8n, attains 36.7% mAP50, exceeding it by 5.7 points. The medium variant reaches 43.1% mAP50 with 58% of the parameters of YOLOv8s, outperforming it by 4.1 points. The improvements are pronounced under the stricter mAP50–95 metric, where the small variant outperforms YOLOv8n by 3.3 points and the medium variant surpasses YOLOv8s by 2.8 points, demonstrating robust localization accuracy across a wide range of IoU thresholds. This consistent superiority in the accuracy–efficiency trade-off extends to the DIOR dataset, confirming the robust generalization of the proposed models across diverse remote sensing scenarios. Moreover, the uncovered capacity-matching regularity offers transferable methodological guidance for designing lightweight detection models tailored to resource-constrained platforms.
Yin et al. (Sat,) studied this question.