Voice spoofing attacks have become a significant challenge in today’s security domain. Although progress has been made in synthetic speech detection technology, existing detection methods still struggle to effectively identify unknown attack strategies. To address these challenges, we propose a novel multi-level acoustic feature fusion framework, MAFF-Net, which comprises three main components: multi-level acoustic feature extraction, cross-attention feature fusion and graph-aggregated detection module. The multi-level acoustic feature extraction module involves two complementary processes: multi-spectrogram feature extraction, which captures low-level physical characteristics of the audio signal, and Wav2vec2 feature extraction, which focuses on high-level speech representations. These multi-level features are subsequently integrated through cross-attention, enhancing the discriminative power of the model. To better evaluate the generalization capability of the proposed model, we introduce Chinese Advanced Synthetic Speech Dataset (CASSD), a new dataset that incorporates speech generated using 11 state-of-the-art synthesis techniques. Extensive experiments conducted across four different datasets demonstrate that our approach consistently outperforms existing single-model methods, highlighting the superior performance of MAFF-Net in synthetic speech detection.
Chen et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: