What type of study is this?

This is a Quantitative Study study.

September 29, 2025Open Access

Hybrid CNN-LSTM Architectures for Deepfake Audio Detection Using Mel Frequency Cepstral Coefficients and Spectogram Analysis

CAClive AsuaiDelta State Polytechnic Ogwashi-Uku AAAyigbe Prince ArinomorDelta State Polytechnic Ogwashi-Uku CACollins Tobore AtumahDelta State Polytechnic Ogwashi-Uku

Key Points

The hybrid model achieved a classification accuracy of 94.7%, surpassing standalone CNN and LSTM models.
It demonstrated exceptional generalization capabilities with a high AUC-ROC score of 97.3%, indicating reliable performance.
Further evaluations on the ASVspoof 2019 dataset confirmed robustness, achieving an accuracy of 93.2%.
The fusion of spectral and temporal features significantly enhances deepfake audio detection for cybersecurity applications.

Abstract

The rapid advancement of AI-generated synthetic speech poses significant threats, including identity fraud and misinformation, as deepfake audio becomes increasingly indistinguishable from genuine recordings. While existing detection methods have achieved high accuracy on specific datasets, they often struggle with generalization across diverse audio samples and real-world conditions. To address this limitation, this paper proposes a hybrid Deep CNN-LSTM model that leverages both Mel Frequency Cepstral Coefficients (MFCCs) and spectrogram analysis to capture complementary spatial and temporal artifacts indicative of synthetic speech. The model was evaluated on the Fake-or-Real (FoR) dataset, achieving a classification accuracy of 94.7%, surpassing standalone CNN (87.3%) and LSTM (82.7%) models. Crucially, the model demonstrated strong generalization capabilities with an AUC-ROC score of 97.3%. Further cross-dataset evaluation on ASVspoof 2019 confirmed its robustness, achieving an accuracy of 93.2%. The results indicate that the fusion of spectral and temporal features through a hybrid architecture provides a more robust solution for detecting AI-generated audio, contributing to the development of reliable deepfake detection systems for cybersecurity and digital forensics applications.

Ask AI

Helpful

Bookmark

View Full Paper