What does this research mean for the field?

The Cross-Purification Mask Network (CPMN) improves single-channel speech separation performance when integrated into mainstream frameworks by refining target speech masks against interfering speech and background noise. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This work aims to improve the accuracy of target speech mask estimation for efficient single-channel speech separation.

May 20, 2026Open Access

Cross-Purification Mask Network: A Mask Refinement Method for Single-Channel Speech Separation

Key Points

This work aims to improve the accuracy of target speech mask estimation for efficient single-channel speech separation.
Developed the Cross-Purification Mask Network with three key modules: DCAM, FCCM, and APMM.
Conducted experiments using datasets including Libri2Mix, WHAM!, and WSJ0-2Mix.
Integrated CPMN into existing frameworks such as Conv-TasNet, DPTNet, and TDANet.
TDANet achieved a SI-SNRi of 17.4 dB (+0.5 dB) on Libri2Mix and 15.2 dB (+0.4 dB) on WHAM!.
Conv-TasNet showed an improvement of 0.3 dB with a SI-SNR of 15.6 dB on WSJ0-2Mix.
DPTNet experienced a 0.4 dB increase with a SI-SNR of 20.8 dB on WSJ0-2Mix.

Abstract

Accurate target speech mask estimation is the key to single-channel speech separation. Masks generated by conventional mask networks are easily corrupted by interfering speech and background noise, which degrades separation performance. To solve this problem, this paper proposes a Cross-Purification Mask Network (CPMN), which consists of three core modules: the Dynamic Context-Aware Mechanism (DCAM), Feature Cross-Complementation Mechanism (FCCM), and Adaptive Purification Mask Mechanism (APMM). The DCAM aggregates dynamic sliding window and long-term temporal features to capture long-range temporal dependencies of masks and enhance the localization accuracy of target speech. The FCCM fuses weighted mask features of interfering speakers to dynamically supplement missing information in target speech masks. The APMM combines adaptive filters and residual networks to output high-precision refined masks. The CPMN is embedded into three mainstream speech separation frameworks including Conv-TasNet, DPTNet, and TDANet, and extensive experiments are conducted on Libri2Mix, WHAM!, and WSJ0-2Mix datasets. The results show that the CPMN brings stable performance gains. After integration, TDANet achieves SI-SNRi of 17.4 dB (+0.5 dB) on Libri2Mix and 15.2 dB (+0.4 dB) on WHAM!. Meanwhile, Conv-TasNet and DPTNet obtain SI-SNR improvements of 0.3 dB (15.6 dB) and 0.4 dB (20.8 dB) on WSJ0-2Mix, respectively.

Cross-Purification Mask Network: A Mask Refinement Method for Single-Channel Speech Separation

Key Points

Abstract

Cite This Study