What question did this study set out to answer?

The study aims to design a low-parameter model for effective noise reduction and speech enhancement in edge computing devices.

March 18, 2026Open Access

A low parameter channel grouped iterative convolutional recurrent network for speech enhancement of noise-reducing headphones

Key Points

The study aims to design a low-parameter model for effective noise reduction and speech enhancement in edge computing devices.
Developed a channel grouped iterative convolutional recurrent network with 15.8 K parameters.
Utilized an improved four-layer block iterative time-frequency convolution module.
Employed channel shift iterative processing for optimal channel information processing.
Incorporated sub-band feature extraction and multi-scale dilated convolution.
Implemented RNN to enhance time domain modeling.
Achieved a PESQ score of 2.70 using GRU and 2.75 using CFC.
Demonstrated a 8.22 dB SISNR, indicating improved speech quality.
Utilized only 0.1 TOPS of computing power in headphones with a 33 ms audio processing delay.

Abstract

Abstract At present, in the field of deep learning speech signal enhancement, encoder-decoder structures are introduced to suppress noise and restore speech. Models with good performance often have large parameters, which is very unfriendly to edge computing chips. In this study, we propose channel grouped iterative temporal frequency convolution convolutional recurrent network with only 15.8 K parameters, which can be easily deployed on headphones. In the encoder-decoder structure, an improved four-layer block iterative temporal frequency convolution module is used. Deep convolutional networks often have some redundancy, which can be effectively reduced by channel grouped processing. In order to make full use of all channel information, the method of channel shift iterative processing is applied, so that all channel information is processed after multi-layer time-frequency convolution module. In the time-frequency convolution module, sub-band feature extraction and multi-scale dilated convolution are used to enhance the frequency domain perception ability, and RNN network is introduced to enhance the time domain modeling ability. Experimental results on the VCTK and DEMAND dataset show that our model with extremely low parameter surpasses conventional methods reaches or even exceed multiple evaluation metrics. Specifically, it achieves a PESQ score of 2.70 using GRU and 2.75 using CFC with 8.22 dB SISNR, reflecting improved speech quality. The algorithm is deployed on the edge computing chip with only 0.1TOPS computing power used in headphones, which can process audio signals with 33 ms delay. Through the joint processing of left and right channels and adaptive training methods, better performance has been achieved.

Bookmark

View Full Paper

Cite This Study

Zhao et al. (Mon,) studied this question.

synapsesocial.com/papers/69ba424e4e9516ffd37a2663 https://doi.org/https://doi.org/10.1186/s13636-026-00455-4

Bookmark

View Full Paper