Los puntos clave no están disponibles para este artículo en este momento.
In recent years, there has been a growing prevalence of deep-learning methods for enhancing speech. Recent research has focused on capturing speech’s long-range dependencies efficiently to enhance performance. However, many studies on speech enhancement typically overlook the distributions of speech signal’s energy in the time-frequency (T-F) domain, a crucial factor for accurately predicting masks. An enhancement model of target speech based on long-term and T-F distributions is presented in this paper. The Feature extraction and reconstruction make up the system. There is a significant limitation to traditional CNNs, which is the use of a fixed-sized kernel, which compromises local as well as contextual data. It is overcomed using multiscale convolution-based feature extractor blocks (MSCFEB) in the extractor. Different kernel sizes are used in a single layer for capturing the interaction of both local as well as contextual information in the signal. There is a time-frequency attention (TFA) module following each MSCFEB in the feature extractor. By usingTFA, varying attention weights are allocated to each time-frequency spectral component. This allows model to focus on specific time frames and frequency channels. By stacking several filtering modules (FMs), the spectrum reconstruction module(SRM) facilitates reconstruction of a spectrum in a progressive manner to reduce magnitude noise. Furthermore, spectrum estimation is ultimately achieved by unfolding the filtering modules repeatedly, gradually improving the intermediate outcome across stages. Tests are conducted with the common voice dataset to validate the proposed approach. The results demonstrate that this model outperformed previous baseline systems.
Parisae et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: