Key points are not available for this paper at this time.
Deep neural network-based speaker identification systems are vulnerable to adversarial attacks. However, the distortions of the adversarial examples are still obvious in most cases. In this work, we therefore propose a universal perturbation-based adaptive network (UPAN) to generate high-quality adversarial examples. Specifically, the UPAN first uses a universal perturbation generative module to generate an input-independent perturbation, which is then fed into an adaptation module to generate the input-specific perturbation. Finally, the sum of the generated perturbation and input audio is used as the adversarial example to attack the speaker identification system. To further enhance the speech quality, we introduce an improved perceptual loss that combines the mean square error and frame-wise cosine similarity of the MFCC features between the input audio and adversarial examples. Experimental results on the VoxCeleb1 dataset demonstrate that the proposed approach is effective for both non-targeted and targeted attacks.
Li et al. (Mon,) studied this question.