Key points are not available for this paper at this time.
Keyword spotting (KS) and automatic speech recognition (ASR) on smart speakers in a home environment with interfering signals from loudspeakers are challenging tasks to this day, despite improvements in acoustic echo cancellation (AEC) systems. In this work we propose to combine a single microphone AEC system, consisting of an adaptive linear filter (linear AEC) and a neural echo suppressor (NES), with an adaptive filter developed for multi-microphone noise reduction, called Cleaner. This additional enhancement step allows the AEC system to profit from spatial information to remove residual echo. The single microphone NES model improves upon the waveform domain counterpart proposed in 1 using a frequency domain representation that helps with generalization. Furthermore, we show that using multiple linear AEC configurations during model training provides large gains over a fixed configuration. On the hardest considered test condition, the proposed system outperforms the baseline model 1 for single microphone input by 66 % (relative) in KS false reject rate (FRR) and 52 % (relative) in ASR word error rate (WER). Using the multi-microphone setting, the FRR is reduced by an additional 52 % and the WER by an additional 32 %.
Heitkaemper et al. (Mon,) studied this question.