As speech translation (ST) systems become increasingly prevalent, understanding their vulnerabilities is crucial for ensuring robust and reliable communication. However, limited work has explored this issue in depth. This paper explores methods of compromising these systems through imperceptible audio manipulations. Specifically, we present two approaches: (1) adapting perturbation-based techniques used for automatic speech recognition (ASR) attacks to the ST context, making our work the first to apply this approach to ST, and (2) proposing a novel music generation-based method to guide targeted translation, while also conducting more practical over-the-air attacks in the physical world. Our experiments reveal that carefully crafted audio perturbations can mislead translation models to produce targeted, harmful outputs, while adversarial music achieve this goal more covertly, exploiting the natural imperceptibility of music. These attacks have proven effective across multiple languages and translation models, highlighting a systemic vulnerability in current ST architectures. Beyond immediate security concerns, our findings highlight broader challenges in the robustness and interpretability of neural speech systems. More details and samples can be found at https://adv-st.github.io.
Liu et al. (Thu,) studied this question.