Abstract At present, in the field of deep learning speech signal enhancement, encoder-decoder structures are introduced to suppress noise and restore speech. Models with good performance often have large parameters, which is very unfriendly to edge computing chips. In this study, we propose channel grouped iterative temporal frequency convolution convolutional recurrent network with only 15.8 K parameters, which can be easily deployed on headphones. In the encoder-decoder structure, an improved four-layer block iterative temporal frequency convolution module is used. Deep convolutional networks often have some redundancy, which can be effectively reduced by channel grouped processing. In order to make full use of all channel information, the method of channel shift iterative processing is applied, so that all channel information is processed after multi-layer time-frequency convolution module. In the time-frequency convolution module, sub-band feature extraction and multi-scale dilated convolution are used to enhance the frequency domain perception ability, and RNN network is introduced to enhance the time domain modeling ability. Experimental results on the VCTK and DEMAND dataset show that our model with extremely low parameter surpasses conventional methods reaches or even exceed multiple evaluation metrics. Specifically, it achieves a PESQ score of 2.70 using GRU and 2.75 using CFC with 8.22 dB SISNR, reflecting improved speech quality. The algorithm is deployed on the edge computing chip with only 0.1TOPS computing power used in headphones, which can process audio signals with 33 ms delay. Through the joint processing of left and right channels and adaptive training methods, better performance has been achieved.
Zhao et al. (Mon,) studied this question.