What question did this study set out to answer?

The aim is to enhance speech quality captured by drones in low signal-to-noise ratio scenarios using a multi-channel framework.

May 14, 2026

Spatial-U-Net: A multi-channel speech enhancement framework for low signal-to-noise ratio scenario

Key Points

The aim is to enhance speech quality captured by drones in low signal-to-noise ratio scenarios using a multi-channel framework.
Proposed Spatial-U-Net framework designed for multi-channel speech enhancement in time domain.
Introduced attention mechanism and hybrid loss combining mean squared error and perceptual loss.
Evaluated on drone-recorded audio with signal-to-noise ratios ranging from -35 to 0 dB.
Achieved significant improvement in signal-to-noise ratio by an unknown margin (exact values not specified).
Enhanced speech quality indicated by improved short-time objective intelligibility metric.
Outperformed existing methods in perceptual evaluation of speech quality.

Abstract

Speech enhancement for drone audition is highly challenging due to the extremely low signal-to-noise ratio (e.g. SNR −15 dB) which is caused by the ego-noise from the rotors. Despite recent advances, most deep learning-based speech enhancement methods for drone audition only focus on single-channel inputs. Even among the few methods designed for multi-channel audio, the enhanced output is often reduced to single-channel, thereby discarding valuable spatial information which is essential for downstream processing tasks. In this paper, we propose Spatial-U-Net, a multi-channel end-to-end deep learning framework designed to suppress drone noise directly in time domain with multi-channel outputs. Based on Wave-U-Net, this model introduces attention mechanism and a hybrid loss combining mean squared error (MSE) and perceptual loss, to improve both signal recognizability and perceptual quality. Evaluated on real drone-recorded audio (SNR ranging from −35 to 0 dB), this method outperforms existing deep learning-based speech enhancement methods in three metrics: SNR improvement, short-time objective intelligibility (STOI), and perceptual evaluation of speech quality (PESQ), demonstrating superior performance in extreme noise scenario.

Perguntar à IA

Bookmark