Unified face attack detection (UAD) systems, which utilize vision–language models to simultaneously discriminate between physical and digital attacks, remain challenging due to the difficulty in effectively identifying live information and the biological cues forged by a variety of distinct technologies. The difficulty stems from two primary aspects: (1) Single spatial information is insufficient for capturing comprehensive cues regarding both genuine and forged identities. (2) Textual prompts struggle to acquire low‐redundancy and complementary information to facilitate the discrimination between live and fake cues. To address these issues, we propose a novel frequency‐aware and ensembled text prompt UAD model built upon the contrastive language‐image pretraining (CLIP) framework. Our model adaptively fuses spatial and frequency information to enhance the representation of genuine faces and all types of attacks, while simultaneously introducing an ensemble learning strategy to acquire low‐redundancy textual prompts. Specifically, the ensemble prompt module generates general live and fake prompts from spatial and frequency information on the language branch, thus guiding the model to learn a unified feature space to deal with different attacks. Meanwhile, this module optimizes the redundancy and complementarity between prompts through an ensemble strategy and a designed information diversity constraint. Furthermore, we design a layer‐wise cross‐attention module in the vision branch to fuse frequency information from different layers. A designed redundancy minimization module is employed on the fused image features, thereby compelling the spatial and frequency feature extraction modules to generate maximally exclusive features. Extensive experiments on multiple benchmarks demonstrate that our model achieves state‐of‐the‐art performance across most protocols of the datasets.
Jiang et al. (Thu,) studied this question.