What question did this study set out to answer?

This research aims to develop a robust algorithm for accurately detecting deepfake videos by analyzing both visual and audio data.

April 28, 2026Open Access

Multimodal Deepfake and Synthetic Media Detector

Key Points

This research aims to develop a robust algorithm for accurately detecting deepfake videos by analyzing both visual and audio data.
Designed a multimodal algorithm integrating visual and audio streams for analysis.
Employed Xception net for visual artifact detection and CNN for audio anomaly detection using Mel spectrograms.
Utilized class imbalance techniques like focal loss and balanced sampling, training on the FakeAVCeleb dataset with real samples from VoxCeleb2.
Achieved effective deepfake detection with improved accuracy over existing visual-only algorithms.
Performance enhanced by integrating visual and audio modalities, demonstrating reduced false positives.
Outperformed previous models in recognizing manipulations across different types of deepfake scenarios.

Abstract

Deepfake videos represent an increasingly serious problem as they use AI to manipulate visual data by replacing a subject's face or voice, thus contributing to spreading misinformation and eroding trust in the media and society as a whole. The current detection algorithms are largely dependent on visual information only, and are therefore susceptible to manipulation through the exploitation of discrepancies between the visual and audio streams.In this study, we design a multimodal algorithm that will simultaneously examine the video's visual stream as well as its accompanying audio. The visual processing is carried out by an Xception net, a pre-trained convolutional neural network, that will analyze visual artifacts of a facial manipulation by examining facial crops obtained via the MTCNN. The audio processing will be carried out using four blocks of a convolutional neural network, analyzing Mel spectrograms and MFCCs for anomalies related to vocal manipulation. Finally, the information obtained from both streams is integrated into a decision using the cross-modal attention gate which is designed to selectively combine the two modalities. Two-stage training is applied to the model. First, only the pretrained visual branch is kept unchanged, while both the audio and fusion heads undergo fresh initialization and are then jointly trained. Second, all parameters are fine-tuned in an end-to-end fashion at a slower learning rate in order to adapt to the task while preserving the features from pretraining. To tackle the class imbalance issue that arises in FakeAVCeleb, focal loss is applied together with balanced mini-batch sampling and overweighting of the real class labels.Training and evaluation are conducted on FakeAVCeleb involving four types of manipulations, such as face swap, voice clone, and artificial generation of the whole audio-visual sequence. To overcome the 42:1 fake-to-real video ratio present in the dataset, about 4,000 real samples from VoxCeleb2 were used as an auxiliary dataset aligned with the training domain. The testing set consists of unseen identities in order to avoid any potential data leakage when splitting them; hence, it was formed separately from training and validation sets.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper