Deepfake audio created with sophisticated speech synthesis and voice cloning technologies is a threat to the credibility of digital communication. Its realism has raised serious concerns in different applications such as digital forensics, cybersecurity, media authentication and voice-based security systems. However, deepfake audio detection still remains difficult. Synthetic speech tends to have subtle artifacts that can mimic the natural vocal pattern very closely. Variations in speakers, recording conditions and background noise make the task more complex. In addition, dataset imbalance and low diversity in training samples could lead to low robustness in the model. To overcome these limitations, the present study aims to propose a framework of transfer learning-based methods based on a combination of fine-tuned pre-trained models, as well as systematic data augmentation. Augmentation methods are introduced to increase the variability and mimic real acoustic conditions. This approach supports the learning of more stable and generalizable representations for both genuine and manipulated speech. The framework employs three DL models: ResNet50 to capture global spectro-temporal structures, VGGish to extract mid-level semantic audio embeddings and YAMNet to identify fine-grained temporal irregularities associated with synthetic speech artifacts. Features from these models are fused through concatenation to construct a unified hybrid feature space. A feature selection stage then reduces redundancy before classification using a lightweight model. Experimental results demonstrate the superiority of the proposed hybrid approach and achieved an accuracy of 99.7%. This performance significantly outperformed individual baseline models and achieved strong generalization across diverse acoustic conditions.
Jahangir et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: