What type of study is this?

This is a Quantitative Study study.

October 5, 2025Open Access

Development of a Robust Foreground Speech Enhancement Module for Sub-Optimal Data

Key Points

The proposed model achieves an STOI score of 0.88 and an SNR improvement of 3.27 dB, enhancing speech quality.
By using a statistics pooling layer with a pre-trained wav2vec2 network, the model handles variable-length inputs effectively.
Log mel-spectrograms contribute to the model's robustness, allowing superior performance even with restricted training data.
Comparative analysis reveals that the proposed approach outperforms both baseline autoencoder models in speech enhancement.

Abstract

Objectives: This work aims to enhance foreground speech by effectively removing unwanted background noise and recovering the desired signal, utilizing deep learning approaches with limited training data. Methods: This study addresses the above issue using a transfer learning-based technique that uses mel-spectrograms. Specifically, it proposes a transfer learning approach that builds on a pre-trained residual network (based on wav2vec2 model) that includes a statistics pooling layer as used in speaker recognition. The model is then trained using a limited amount of clean and noisy datasets. In addition, we adopt a log mel-spectrogram feature extraction technique to improve the generalization of speech enhancement models. The database used here is from the Noisy Speech Database curated by Valentini-Botinhao, Cassia (2017) and the LibriSpeech corpus. Findings: Using the same dataset, the performances of the baseline model of an autoencoder and a multilayer autoencoder were compared with the proposed model. The proposed approach with an STOI score of 0.88 and an SNR improvement of 3.27 dB, outperforms both the baseline models in subjective and objective evaluation. Novelty: This work eliminates signal truncation, a constraint observed in conventional speech enhancement pipelines, by integrating a statistics pooling layer with a pre-trained wav2vec2-based residual network for variable-length input handling. Furthermore, the model's robustness and flexibility are enhanced by the use of log mel-spectrograms in this context, allowing it to produce state-of-the-art results even with sparse supervised training data. Keywords: Denoise, Mel-spectrogram, Signal Processing, Transfer Learning, Wav2Vec2

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Debabrata Gogoi

Sushanta Kabir Dutta

Journals

Indian Journal of Science and Technology

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Development of a Robust Foreground Speech Enhancement Module for Sub-Optimal Data

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study