Monaural speech enhancement (SE) is a technique for extracting a clean speech signal from a monaural noisy speech signal. Its mainstream approach, supervised learning, uses supervised data, i.e., pairs of clean and noisy speech data. However, this approach has the problem that supervised data are expensive because recording clean speech data requires a quiet environment such as a studio. In this paper, an SE method using a semi-supervised learning method called positive-negative-unlabeled (PNU) learning is proposed. To achieve high SE performance even with limited supervised data, the proposed method leverages unsupervised data, i.e., only noisy speech data. Note that unsupervised data can be easily collected, e.g., from smart speakers or the Web. In our method, a deep neural network predicts a binary mask for SE by classifying time-frequency bins as speech-dominant (positive, P) or noise-dominant (negative, N). It is trained through PNU learning using P and N data from supervised data and unlabeled (U) data from unsupervised data. An experiment confirmed that increasing U data improves the SE performance of the proposed method and enables it to outperform supervised learning.
Ogawa et al. (Tue,) studied this question.