Speech enhancement for drone audition is highly challenging due to the extremely low signal-to-noise ratio (e.g. SNR −15 dB) which is caused by the ego-noise from the rotors. Despite recent advances, most deep learning-based speech enhancement methods for drone audition only focus on single-channel inputs. Even among the few methods designed for multi-channel audio, the enhanced output is often reduced to single-channel, thereby discarding valuable spatial information which is essential for downstream processing tasks. In this paper, we propose Spatial-U-Net, a multi-channel end-to-end deep learning framework designed to suppress drone noise directly in time domain with multi-channel outputs. Based on Wave-U-Net, this model introduces attention mechanism and a hybrid loss combining mean squared error (MSE) and perceptual loss, to improve both signal recognizability and perceptual quality. Evaluated on real drone-recorded audio (SNR ranging from −35 to 0 dB), this method outperforms existing deep learning-based speech enhancement methods in three metrics: SNR improvement, short-time objective intelligibility (STOI), and perceptual evaluation of speech quality (PESQ), demonstrating superior performance in extreme noise scenario.
Wei et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: