What question did this study set out to answer?

This study aims to develop a hierarchical decoding framework for perceived speech using non-invasive brain recordings.

May 1, 2026

Hierarchical Decoding of Perceived Speech from Non-Invasive Brain Recordings.

Key Points

This study aims to develop a hierarchical decoding framework for perceived speech using non-invasive brain recordings.
Utilized magneto/electro-encephalography (M/EEG) to record brain activity while participants listened to speech.
Developed ConvConcatNet to integrate multiple speech representations: mel-spectrogram, Wav2vec 2.0, and GPT-2.
Evaluated decoding performance on two datasets: SMN4Lang (Chinese MEG) and SparrKULee (Dutch EEG).
Wav2vec 2.0 representation achieved the highest decoding performance, but integrating all three significantly improved outcomes.
ConvConcatNet demonstrated a Top-1 accuracy of 35.6% on the MEG dataset and 20.0% on the EEG dataset.
Outperformed previous methods, particularly noted for enhancing accuracy with contextually dependent words.

Abstract

Non-invasive speech perception decoding aims to identify speech segments using magneto/electro-encephalography (M/EEG) signals recorded while subjects listen to speech. Although speech perception is widely recognized as a hierarchical process from the auditory periphery to the auditory cortex, and to the whole brain, this hierarchy is rarely considered in existing decoding methods. In this study, we propose a novel hierarchical decoding framework that leverages three distinct speech representations: mel-spectrogram, Wav2vec 2.0, and GPT-2. These representations encompass a wide range of speech features, spanning from low-level acoustic properties to high-level linguistic information. Our proposed decoder, ConvConcatNet, utilizes iterative convolution and concatenation to extract and integrate patterns of the hierarchical neural responses. These neural features are subsequently aligned with the speech representations through contrastive learning. The decoding performance was evaluated on a Chinese MEG dataset (SMN4Lang) and a Dutch EEG dataset (SparrKULee). Our results show that while the Wav2vec 2.0 representation achieved the highest decoding performance among the three, the integration of all three representations led to a substantial improvement, highlighting the critical role of combining them for enhanced performance. Notably, the GPT-2 representation enhanced decoding accuracy particularly for words with greater contextual dependencies. Moreover, our ConvConcatNet decoder outperformed existing methods, showcasing superior capabilities in neural feature extraction and integration. Our method achieves a Top-1 accuracy of 35.6% on the MEG dataset and 20.0% on the EEG dataset, significantly outperforming the previous state-of-the-art method under the same experimental settings. The source code is available at https://github.com/bobwangPKU/HDPS.

KI fragen

Bookmark