The rapid advancement of AI-generated synthetic speech poses significant threats, including identity fraud and misinformation, as deepfake audio becomes increasingly indistinguishable from genuine recordings. While existing detection methods have achieved high accuracy on specific datasets, they often struggle with generalization across diverse audio samples and real-world conditions. To address this limitation, this paper proposes a hybrid Deep CNN-LSTM model that leverages both Mel Frequency Cepstral Coefficients (MFCCs) and spectrogram analysis to capture complementary spatial and temporal artifacts indicative of synthetic speech. The model was evaluated on the Fake-or-Real (FoR) dataset, achieving a classification accuracy of 94.7%, surpassing standalone CNN (87.3%) and LSTM (82.7%) models. Crucially, the model demonstrated strong generalization capabilities with an AUC-ROC score of 97.3%. Further cross-dataset evaluation on ASVspoof 2019 confirmed its robustness, achieving an accuracy of 93.2%. The results indicate that the fusion of spectral and temporal features through a hybrid architecture provides a more robust solution for detecting AI-generated audio, contributing to the development of reliable deepfake detection systems for cybersecurity and digital forensics applications.
Building similarity graph...
Analyzing shared references across papers
Loading...
Clive Asuai
Ayigbe Arinomor
Collins Tobore Atumah
American Journal of Mathematical and Computer Modelling
Delta State Polytechnic Ogwashi-Uku
Building similarity graph...
Analyzing shared references across papers
Loading...
Asuai et al. (Thu,) studied this question.
www.synapsesocial.com/papers/68da58dcc1728099cfd1151a — DOI: https://doi.org/10.11648/j.ajmcm.20251003.12