March 18, 2024Open Access

A Study on the Adverse Impact of Synthetic Speech on Speech Recognition

JHJian HuangShandong Institute of Automation YBYancheng BaiAlibaba Group (United States)YCYang CaiCentral South University

Key Points

Key points are not available for this paper at this time.

Abstract

The high-quality synthetic speech by TTS has been widely used in the field of human-computer interaction, bringing users better experience. However, synthetic speech is prone to be mixed with real human speech as part of the noise and recorded by the microphone, which leads to performance decrease for speech recognition. To address this issue, we propose different methods to study the adverse impact of synthetic speech on speech recognition, thereby enhancing its robustness. On the one hand, we adopt the concept of fake audio detection and incorporate an additional module into speech recognition model to differentiate between real and synthetic speech. On the other hand, we propose various methods of incorporating prompt labels from a language semantics perspective to achieve differentiation. These prompt labels provide contextual cues that help speech recognition model to better understand the difference between the two types of speech. The experimental results demonstrate the acoustic modeling of ASR is capable of distinguishing between real and synthetic speech effectively. Putting the prompt labels at the beginning achieves the best performance in a clean synthetic data scenario, while emptying the transcripts of synthetic speech obtains the best performance in a noisy synthetic data scenario.

Ask AI

Helpful

Bookmark

View Full Paper