Key points are not available for this paper at this time.
The high-quality synthetic speech by TTS has been widely used in the field of human-computer interaction, bringing users better experience. However, synthetic speech is prone to be mixed with real human speech as part of the noise and recorded by the microphone, which leads to performance decrease for speech recognition. To address this issue, we propose different methods to study the adverse impact of synthetic speech on speech recognition, thereby enhancing its robustness. On the one hand, we adopt the concept of fake audio detection and incorporate an additional module into speech recognition model to differentiate between real and synthetic speech. On the other hand, we propose various methods of incorporating prompt labels from a language semantics perspective to achieve differentiation. These prompt labels provide contextual cues that help speech recognition model to better understand the difference between the two types of speech. The experimental results demonstrate the acoustic modeling of ASR is capable of distinguishing between real and synthetic speech effectively. Putting the prompt labels at the beginning achieves the best performance in a clean synthetic data scenario, while emptying the transcripts of synthetic speech obtains the best performance in a noisy synthetic data scenario.
Building similarity graph...
Analyzing shared references across papers
Loading...
Jian Huang
Yancheng Bai
Alibaba Group (United States)
Yang Cai
Central South University
Alibaba Group (China)
Building similarity graph...
Analyzing shared references across papers
Loading...
Huang et al. (Mon,) studied this question.
synapsesocial.com/papers/68e73996b6db6435876b3765 — DOI: https://doi.org/10.1109/icassp48485.2024.10446991