What question did this study set out to answer?

The study aims to improve deepfake audio detection using transfer learning and data augmentation techniques.

June 17, 2026Open Access

Fine-Tune Transfer Learning Model for Deepfake Audio Detection Using Hybrid Features and Data Augmentation

Key Points

The study aims to improve deepfake audio detection using transfer learning and data augmentation techniques.
Developed a framework using fine-tuned pre-trained models: ResNet50, VGGish, and YAMNet.
Implemented systematic data augmentation to enhance training sample diversity.
Conducted feature fusion and selection before using a lightweight model for classification.
Achieved an accuracy of 99.7% in detecting deepfake audio.
Outperformed individual baseline models in performance and generalization across different acoustic conditions.

Abstract

Deepfake audio created with sophisticated speech synthesis and voice cloning technologies is a threat to the credibility of digital communication. Its realism has raised serious concerns in different applications such as digital forensics, cybersecurity, media authentication and voice-based security systems. However, deepfake audio detection still remains difficult. Synthetic speech tends to have subtle artifacts that can mimic the natural vocal pattern very closely. Variations in speakers, recording conditions and background noise make the task more complex. In addition, dataset imbalance and low diversity in training samples could lead to low robustness in the model. To overcome these limitations, the present study aims to propose a framework of transfer learning-based methods based on a combination of fine-tuned pre-trained models, as well as systematic data augmentation. Augmentation methods are introduced to increase the variability and mimic real acoustic conditions. This approach supports the learning of more stable and generalizable representations for both genuine and manipulated speech. The framework employs three DL models: ResNet50 to capture global spectro-temporal structures, VGGish to extract mid-level semantic audio embeddings and YAMNet to identify fine-grained temporal irregularities associated with synthetic speech artifacts. Features from these models are fused through concatenation to construct a unified hybrid feature space. A feature selection stage then reduces redundancy before classification using a lightweight model. Experimental results demonstrate the superiority of the proposed hybrid approach and achieved an accuracy of 99.7%. This performance significantly outperformed individual baseline models and achieved strong generalization across diverse acoustic conditions.

Mark Helpful

Bookmark

Relay

View Full Paper