March 18, 2024Open Access

Audiovisual Speaker Separation with Full- and Sub-Band Modeling in the Time-Frequency Domain

Key Points

Key points are not available for this paper at this time.

Abstract

We introduce a new deep learning model for talker-independent audiovisual speaker separation in noisy conditions in the time-frequency domain. The inputs to the model include noisy multi-talker mixtures and the corresponding cropped face images. Our approach incorporates cross-attention audiovisual fusion, effectively merging audio and visual features and enabling seamless information interchange between auditory and visual modalities. These fused features drive a separator module, which separates the acoustic features of individual speakers. The separator module is based on the recently proposed TF-Gridnet, which comprises an intra-frame full-band component, a sub-band temporal module that captures frequency-specific temporal dependencies, and a cross-attention module dedicated to extracting long-term fused audiovisual features. To encourage the utilization of visual streams during training, we employ a Signal-to-Noise Ratio (SNR) scheduler. Experimental results demonstrate that the proposed model advances the state-of- the-art speaker separation performance in several audiovisual benchmark datasets.

Read Full Paperexternally

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Institutions

The Ohio State University

META Health

References and Citations

Add This Paper to Your Research Feed

Any time a new paper drops it will be there.