What question did this study set out to answer?

The research aims to develop a robust framework for detecting deepfakes by integrating textual and frequency data.

April 12, 2026Open Access

Cross-modal deepfake detection: integrating textual and frequency domains

Key Points

The research aims to develop a robust framework for detecting deepfakes by integrating textual and frequency data.
Developed a Cross-Modal Integration framework (CMITF) utilizing pre-trained models and specific modules.
Used the CLIP model to extract textual embeddings from generated captions.
Implemented a DCT-based frequency encoder for high-frequency representation capturing.
Aligned the textual and frequency modalities through a Cross-modal Complementary Feature Alignment (CCFA) module.
Applied a lightweight Feature Aggregation and Classification (FAC) module for efficient deployment.
CMITF significantly outperformed existing state-of-the-art detection methods.
Demonstrated superior performance in cross-dataset and cross-compression scenarios.
Established the effectiveness of combining large semantic models with compact frequency detectors.

Abstract

Recent advances in generative models have enabled visually convincing deepfake images, raising urgent concerns for trustworthy multimedia services in cloud-enabled platforms. Most existing detectors rely on uni-modal spatial or frequency cues and thus generalize poorly when the forgery distribution shifts across datasets or unseen generators. In this paper, we propose CMITF (Cross-Modal Integration of Textual-Frequency), a collaborative intelligence framework that synergizes large pre-trained models with lightweight task-specific modules for robust, transferable deepfake detection in cloud environments. CMITF leverages a large-scale vision-language model (CLIP) to extract rich textual embeddings from ClipCap-generated captions, while a compact frequency encoder captures DCT-based high-frequency representations. The two modalities are aligned via a correlation-driven Cross-modal Complementary Feature Alignment (CCFA) module, enabling efficient collaboration between the large semantic model and the small frequency-domain detector. A lightweight Feature Aggregation and Classification (FAC) module further applies spatial attention and global pooling to produce compact representations for resource-efficient cloud deployment. Extensive experiments on widely-used benchmarks demonstrate that CMITF consistently outperforms state-of-the-art methods, particularly in cross-dataset and cross-compression scenarios, showcasing the effectiveness of large-small model synergy for scalable and reliable content authentication in cloud-based media processing pipelines.

Bookmark

View Full Paper

Cite This Study

Chen et al. (Fri,) studied this question.

synapsesocial.com/papers/69db37964fe01fead37c58ef https://doi.org/https://doi.org/10.1186/s13677-026-00898-2

Bookmark

View Full Paper