Small object detection in UAV remote sensing imagery has long faced significant challenges. Existing Transformer-based detectors still suffer from feature degradation and insufficient multi-scale information fusion when handling small objects with sparse pixels and complex backgrounds. To address this, we propose MSF-DETR, a Transformer-based detector with multi-scale perception and cross-spatial-frequency domain fusion. Specifically, we design a multi-scale perception attention feature extraction network that integrates a Poly Kernel Inception module with a bidirectional contextual anchor attention mechanism via a dual-pathway fusion block, enabling simultaneous capture of multi-granularity features and long-range semantic dependencies. We further develop a feature alignment and cross-spatial-frequency enhancement pyramid that enriches shallow-layer spatial details through feature reorganization and leverages a spatial-frequency dual-domain collaborative strategy to capture both local textures and global spectral dependencies. Cross-scale dynamic intensity modulation combined with decoupled lightweight downsampling further effectively suppresses semantic noise, corrects feature misalignment, and preserves critical edge details. Finally, a Shape-NWD loss is devised to incorporate geometric and scale constraints, effectively alleviating the positional sensitivity of IoU for small targets. Extensive experiments on three public benchmarks demonstrate the superior performance of MSF-DETR; notably, on the VisDrone dataset, it achieves improvements of 7.45% and 8.71% in mAP50 and mAP50:95 over the baseline.
Shi et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: