Non-invasive speech perception decoding aims to identify speech segments using magneto/electro-encephalography (M/EEG) signals recorded while subjects listen to speech. Although speech perception is widely recognized as a hierarchical process from the auditory periphery to the auditory cortex, and to the whole brain, this hierarchy is rarely considered in existing decoding methods. In this study, we propose a novel hierarchical decoding framework that leverages three distinct speech representations: mel-spectrogram, Wav2vec 2.0, and GPT-2. These representations encompass a wide range of speech features, spanning from low-level acoustic properties to high-level linguistic information. Our proposed decoder, ConvConcatNet, utilizes iterative convolution and concatenation to extract and integrate patterns of the hierarchical neural responses. These neural features are subsequently aligned with the speech representations through contrastive learning. The decoding performance was evaluated on a Chinese MEG dataset (SMN4Lang) and a Dutch EEG dataset (SparrKULee). Our results show that while the Wav2vec 2.0 representation achieved the highest decoding performance among the three, the integration of all three representations led to a substantial improvement, highlighting the critical role of combining them for enhanced performance. Notably, the GPT-2 representation enhanced decoding accuracy particularly for words with greater contextual dependencies. Moreover, our ConvConcatNet decoder outperformed existing methods, showcasing superior capabilities in neural feature extraction and integration. Our method achieves a Top-1 accuracy of 35.6% on the MEG dataset and 20.0% on the EEG dataset, significantly outperforming the previous state-of-the-art method under the same experimental settings. The source code is available at https://github.com/bobwangPKU/HDPS.
Wang et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: