Key points are not available for this paper at this time.
Audio-visual speech enhancement (AVSE) refers to the use of visual information to assist noise reduction when performing speech enhancement tasks in multimodal scenes. For the AVSE task, especially in low signal-to-noise ratio scenarios, lip movements play an important role in hearing, based on which we design more effective models to improve the performance of audio-visual speech enhancement. In this paper, we propose an innovative AVSE model which assists speech enhancement by extracting visual features. Specifically, the network consists of 3 main parts. Firstly, Resnet18, feature pyramid network (FPN) and coordinate attention (CA) modules are combined to extract multi-scale visual features. Secondly, multi-scale speech features are better extracted by double-branching combined with cavity convolution and cascade convolution, and the temporal data is modeled using a temporal convolutional network (TCN) module. Finally, for the fused audio-visual features, the time and frequency domain features are extracted using the parallel conformer module to better aggregate the global and local information of the sequence parts. Experiments on the GRID audio-visual dataset show that the model outperforms common single-channel speech enhancement models, and the effectiveness of the modules is demonstrated by ablation tests.
Jia et al. (Wed,) studied this question.