What type of study is this?

This is a Quantitative Study study.

September 29, 2025Open Access

Hybrid CNN-LSTM Architectures for Deepfake Audio Detection Using Mel Frequency Cepstral Coefficients and Spectogram Analysis

Key Points

The hybrid model achieved a classification accuracy of 94.7%, surpassing standalone CNN and LSTM models.
It demonstrated exceptional generalization capabilities with a high AUC-ROC score of 97.3%, indicating reliable performance.
Further evaluations on the ASVspoof 2019 dataset confirmed robustness, achieving an accuracy of 93.2%.
The fusion of spectral and temporal features significantly enhances deepfake audio detection for cybersecurity applications.

Abstract

The rapid advancement of AI-generated synthetic speech poses significant threats, including identity fraud and misinformation, as deepfake audio becomes increasingly indistinguishable from genuine recordings. While existing detection methods have achieved high accuracy on specific datasets, they often struggle with generalization across diverse audio samples and real-world conditions. To address this limitation, this paper proposes a hybrid Deep CNN-LSTM model that leverages both Mel Frequency Cepstral Coefficients (MFCCs) and spectrogram analysis to capture complementary spatial and temporal artifacts indicative of synthetic speech. The model was evaluated on the Fake-or-Real (FoR) dataset, achieving a classification accuracy of 94.7%, surpassing standalone CNN (87.3%) and LSTM (82.7%) models. Crucially, the model demonstrated strong generalization capabilities with an AUC-ROC score of 97.3%. Further cross-dataset evaluation on ASVspoof 2019 confirmed its robustness, achieving an accuracy of 93.2%. The results indicate that the fusion of spectral and temporal features through a hybrid architecture provides a more robust solution for detecting AI-generated audio, contributing to the development of reliable deepfake detection systems for cybersecurity and digital forensics applications.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Asuai et al. (Thu,) studied this question.

synapsesocial.com/papers/68da58dcc1728099cfd1151a https://doi.org/https://doi.org/10.11648/j.ajmcm.20251003.12

Bookmark

View Full Paper