Accurate target speech mask estimation is the key to single-channel speech separation. Masks generated by conventional mask networks are easily corrupted by interfering speech and background noise, which degrades separation performance. To solve this problem, this paper proposes a Cross-Purification Mask Network (CPMN), which consists of three core modules: the Dynamic Context-Aware Mechanism (DCAM), Feature Cross-Complementation Mechanism (FCCM), and Adaptive Purification Mask Mechanism (APMM). The DCAM aggregates dynamic sliding window and long-term temporal features to capture long-range temporal dependencies of masks and enhance the localization accuracy of target speech. The FCCM fuses weighted mask features of interfering speakers to dynamically supplement missing information in target speech masks. The APMM combines adaptive filters and residual networks to output high-precision refined masks. The CPMN is embedded into three mainstream speech separation frameworks including Conv-TasNet, DPTNet, and TDANet, and extensive experiments are conducted on Libri2Mix, WHAM!, and WSJ0-2Mix datasets. The results show that the CPMN brings stable performance gains. After integration, TDANet achieves SI-SNRi of 17.4 dB (+0.5 dB) on Libri2Mix and 15.2 dB (+0.4 dB) on WHAM!. Meanwhile, Conv-TasNet and DPTNet obtain SI-SNR improvements of 0.3 dB (15.6 dB) and 0.4 dB (20.8 dB) on WSJ0-2Mix, respectively.
Zhu et al. (Fri,) studied this question.