What question did this study set out to answer?

This research aims to improve multi-microphone noise reduction using advanced deep learning techniques.

May 14, 2026

Two deep learning approaches for multi-microphone beamforming

Key Points

This research aims to improve multi-microphone noise reduction using advanced deep learning techniques.
Introduces peerRTF method leveraging graph convolutional networks to infer relative transfer functions.
Describes ExNet-BF + PF, a two-stage architecture that estimates beamformer weights using deep neural networks.
peerRTF shows increased performance in spatially constrained regions compared to traditional methods.
ExNet-BF + PF maintains interpretability while enhancing noise reduction efficiency in audio signals.

Abstract

Multi-microphone noise reduction remains a central topic in audio signal processing. Traditionally, spatial filters—such as the Minimum Variance Distortionless Response (MVDR) and Multichannel Wiener Filter (MWF)—are designed to meet specific optimization objectives. Incorporating acoustic propagation models has been shown to enhance their performance significantly. This talk begins with a brief overview of beamforming for speech enhancement, emphasizing the role of acoustic modeling. We then introduce two recent deep neural network (DNN)-based approaches for multi-microphone processing. The first method, peerRTF, adheres to the MVDR criterion with the steering vector implemented with the Relative Transfer Function (RTF). By learning the structure of the RTF manifold using a graph convolutional network (GCN), the method enables robust inference of RTFs in a spatially constrained region, leading to enhanced beamformer performance. The second method, ExNet-BF + PF (Explainable DNN-based Beamformer with Postfilter), directly estimates beamformer weights using a DNN while preserving the interpretability and constraints of classical designs. It employs a two-stage architecture, consisting of a multichannel spatial filter with time-invariant weights, followed by a time-varying single-channel postfilter. We further analyze how spatial cues are exploited during inference, providing insights into the learned beam patterns. Audio demonstrations will accompany the presentation.

Mark Helpful

Bookmark

Relay