Underwater Object Detection (UOD) based on Synthetic Aperture Sonar (SAS) images is one of the core tasks of underwater intelligent perception systems. However, the existing UOD methods suffer from excessive model redundancy, high computational demands, and severe image quality degradation due to noise. To mitigate these issues, this paper proposes an ultra-lightweight and high-precision underwater object detection method for SAS images. Based on a single-stage detection framework, four efficient and representative lightweight modules are developed, focusing on three key stages: feature extraction, feature fusion, and feature enhancement. For feature extraction, the Dilated-Attention Aggregation Feature Module (DAAFM) is introduced, which leverages a multi-scale Dilated Attention mechanism for strengthening the model’s capability to perceive key information, thereby improving the expressiveness and spatial coverage of extracted features. For feature fusion, the Channel–Spatial Parallel Attention with Gated Enhancement (CSPA-Gate) module is proposed, which integrates channel–spatial parallel modeling and gated enhancement to achieve effective fusion of multi-level semantic features and dynamic response to salient regions. In terms of feature enhancement, the Spatial Gated Channel Attention Module (SGCAM) is introduced to strengthen the model’s ability to discriminate the importance of feature channels through spatial gating, thereby improving robustness to complex background interference. Furthermore, the Context-Aware Feature Enhancement Module (CAFEM) is designed to guide feature learning using contextual structural information, enhancing semantic consistency and feature stability from a global perspective. To alleviate the challenge of limited sample size of real sonar images, a diffusion generative model is employed to synthesize a set of pseudo-sonar images, which are then combined with the real sonar dataset to construct an augmented training set. A two-stage training strategy is proposed: the model is first trained on the real dataset and then fine-tuned on the synthetic dataset to enhance generalization and improve detection robustness. The SCTD dataset results confirm that the proposed technique achieves better precision than the baseline model with only 10% of its parameter size. Notably, on a hybrid dataset, the proposed method surpasses Faster R-CNN by 10.3% in mAP50 while using only 9% of its parameters.
Xu et al. (Mon,) studied this question.