Key points are not available for this paper at this time.
We introduce a new deep learning model for talker-independent audiovisual speaker separation in noisy conditions in the time-frequency domain. The inputs to the model include noisy multi-talker mixtures and the corresponding cropped face images. Our approach incorporates cross-attention audiovisual fusion, effectively merging audio and visual features and enabling seamless information interchange between auditory and visual modalities. These fused features drive a separator module, which separates the acoustic features of individual speakers. The separator module is based on the recently proposed TF-Gridnet, which comprises an intra-frame full-band component, a sub-band temporal module that captures frequency-specific temporal dependencies, and a cross-attention module dedicated to extracting long-term fused audiovisual features. To encourage the utilization of visual streams during training, we employ a Signal-to-Noise Ratio (SNR) scheduler. Experimental results demonstrate that the proposed model advances the state-of- the-art speaker separation performance in several audiovisual benchmark datasets.
Building similarity graph...
Analyzing shared references across papers
Loading...
The Ohio State University
META Health
Add This Paper to Your Research Feed
Any time a new paper drops it will be there.
Kalkhorani et al. (Mon,) studied this question.