In remote sensing imagery, objects smaller than 32×32 pixels suffer from three persistent challenges that existing detectors inadequately resolve: (1) their weak signal is easily submerged in background clutter, causing high miss rates; (2) the scarcity of valid pixels yields few geometric or textural cues, hindering discriminative feature extraction; and (3) successive down-sampling irreversibly discards high-frequency details, while multi-scale pyramids still fail to compensate. To counteract these issues, we propose MS-YOLOv11, an enhanced YOLOv11 variant that integrates “frequency-domain detail preservation, lightweight receptive-field expansion, and adaptive cross-scale fusion.” Specifically, a 2D Haar wavelet first decomposes the image into multiple frequency sub-bands to explicitly isolate and retain high-frequency edges and textures while suppressing noise. Each sub-band is then processed independently by small-kernel depthwise convolutions that enlarge the receptive field without over-smoothing. Finally, the Mix Structure Block (MSB) employs the MSPLCK module to perform densely sampled multi-scale atrous convolutions for rich context of diminutive objects, followed by the EPA module that adaptively fuses and re-weights features via residual connections to suppress background interference. Extensive experiments on DOTA and DIOR demonstrate that MS-YOLOv11 surpasses the baseline in mAP@50, mAP@95, parameter efficiency, and inference speed, validating its targeted efficacy for small-object detection.
Liu et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: