Deepfake videos represent an increasingly serious problem as they use AI to manipulate visual data by replacing a subject's face or voice, thus contributing to spreading misinformation and eroding trust in the media and society as a whole. The current detection algorithms are largely dependent on visual information only, and are therefore susceptible to manipulation through the exploitation of discrepancies between the visual and audio streams.In this study, we design a multimodal algorithm that will simultaneously examine the video's visual stream as well as its accompanying audio. The visual processing is carried out by an Xception net, a pre-trained convolutional neural network, that will analyze visual artifacts of a facial manipulation by examining facial crops obtained via the MTCNN. The audio processing will be carried out using four blocks of a convolutional neural network, analyzing Mel spectrograms and MFCCs for anomalies related to vocal manipulation. Finally, the information obtained from both streams is integrated into a decision using the cross-modal attention gate which is designed to selectively combine the two modalities. Two-stage training is applied to the model. First, only the pretrained visual branch is kept unchanged, while both the audio and fusion heads undergo fresh initialization and are then jointly trained. Second, all parameters are fine-tuned in an end-to-end fashion at a slower learning rate in order to adapt to the task while preserving the features from pretraining. To tackle the class imbalance issue that arises in FakeAVCeleb, focal loss is applied together with balanced mini-batch sampling and overweighting of the real class labels.Training and evaluation are conducted on FakeAVCeleb involving four types of manipulations, such as face swap, voice clone, and artificial generation of the whole audio-visual sequence. To overcome the 42:1 fake-to-real video ratio present in the dataset, about 4,000 real samples from VoxCeleb2 were used as an auxiliary dataset aligned with the training domain. The testing set consists of unseen identities in order to avoid any potential data leakage when splitting them; hence, it was formed separately from training and validation sets.
P et al. (Sun,) studied this question.