Key points are not available for this paper at this time.
We introduce a new deep learning model for talker-independent audiovisual speaker separation in noisy conditions in the time-frequency domain. The inputs to the model include noisy multi-talker mixtures and the corresponding cropped face images. Our approach incorporates cross-attention audiovisual fusion, effectively merging audio and visual features and enabling seamless information interchange between auditory and visual modalities. These fused features drive a separator module, which separates the acoustic features of individual speakers. The separator module is based on the recently proposed TF-Gridnet, which comprises an intra-frame full-band component, a sub-band temporal module that captures frequency-specific temporal dependencies, and a cross-attention module dedicated to extracting long-term fused audiovisual features. To encourage the utilization of visual streams during training, we employ a Signal-to-Noise Ratio (SNR) scheduler. Experimental results demonstrate that the proposed model advances the state-of- the-art speaker separation performance in several audiovisual benchmark datasets.
Kalkhorani et al. (Mon,) studied this question.