Recent advances in generative models have enabled visually convincing deepfake images, raising urgent concerns for trustworthy multimedia services in cloud-enabled platforms. Most existing detectors rely on uni-modal spatial or frequency cues and thus generalize poorly when the forgery distribution shifts across datasets or unseen generators. In this paper, we propose CMITF (Cross-Modal Integration of Textual-Frequency), a collaborative intelligence framework that synergizes large pre-trained models with lightweight task-specific modules for robust, transferable deepfake detection in cloud environments. CMITF leverages a large-scale vision-language model (CLIP) to extract rich textual embeddings from ClipCap-generated captions, while a compact frequency encoder captures DCT-based high-frequency representations. The two modalities are aligned via a correlation-driven Cross-modal Complementary Feature Alignment (CCFA) module, enabling efficient collaboration between the large semantic model and the small frequency-domain detector. A lightweight Feature Aggregation and Classification (FAC) module further applies spatial attention and global pooling to produce compact representations for resource-efficient cloud deployment. Extensive experiments on widely-used benchmarks demonstrate that CMITF consistently outperforms state-of-the-art methods, particularly in cross-dataset and cross-compression scenarios, showcasing the effectiveness of large-small model synergy for scalable and reliable content authentication in cloud-based media processing pipelines.
Chen et al. (Fri,) studied this question.