The prediction of ocean sound speed fields (SSFs) is critical for underwater communication, marine resource exploration, and environmental monitoring. Due to the powerful generalization ability, deep learning technology has demonstrated its advantages in SSF prediction. However, limited by the processing capabilities of high-dimensional data, current research can only realize the three-dimensional characteristic extraction, without capturing the complete spatiotemporal information of SSF. In this work, we propose the Swin Transformer-UNet model (ST-UNet), which combines the convolutional networks U-Net and Swin Transformer networks, to approach the four-dimensional prediction of SSF. In this model, Swin Transformer network is applied to extract spatiotemporal characteristics through the multi-head self-attention mechanism, while U-Net enhances spatial details via the convolutional feature recovery. The availability and accuracy of the model are demonstrated by the real-life dataset from the South China Sea. It achieves a root mean square error of 0.783 m/s for 24-h SSF prediction based on 7-day historical data, outperforming baseline architectures by 33%–72%.
Li et al. (Sun,) studied this question.