Audio watermarking technology plays a crucial role in digital copyright protection and content authentication. In recent years, audio watermarking methods based on deep neural networks have attracted significant attention. These methods typically consist of an encoder, a distortion simulation layer, and a decoder, enabling end-to-end training for watermark embedding and extraction. However, existing approaches still face limitations in encoder structure design, primarily reflected in the insufficient fusion between watermarks and audio features, as well as the restricted ability to model spectral details and overall structures, which affects the imperceptibility and robustness of audio watermarks. To address these issues, this paper proposes a robust audio watermarking method based on a dual-encoder U-Net and Short-Time Fourier Transform. The proposed framework constructs an embedding and extraction network for audio watermarking. Specifically, the watermark embedding network consists of a dual-encoder U-Net and a multi-scale feature fusion module, which effectively extracts and integrates features from the audio amplitude spectrogram and the watermark sequence, embedding the watermark into different spectral regions to enhance imperceptibility. Meanwhile, the watermark extraction network introduces a multi-scale fusion module that integrates local and global features through parallel convolutional paths with different receptive fields, significantly improving the watermark extraction performance. Experimental results show that the proposed method not only exhibits good imperceptibility compared to other methods on the three public datasets but also demonstrates excellent robustness against multiple attacks, with watermark extraction accuracy approaching 100% under most attacks.
Wen et al. (Wed,) studied this question.