What does this research mean for the field?

The proposed frequency-aware and ensembled text prompt UAD model significantly improves the detection of both physical and digital face attacks by enhancing the representation of genuine faces and various types of attacks. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The aim is to enhance unified face attack detection by effectively identifying genuine and forged identities using multimodal information.

March 12, 2026Open Access

Frequency‐Aware Cue Fusion and Ensembled Prompt Learning for Unified Face Attack Detection With Vision–Language Model

Puntos clave

The aim is to enhance unified face attack detection by effectively identifying genuine and forged identities using multimodal information.
Developed a frequency-aware and ensembled text prompt model based on the CLIP framework.
Fused spatial and frequency information to improve representation of faces and attacks.
Introduced ensemble learning to generate low-redundancy textual prompts for better discrimination.
Implemented layer-wise cross-attention to combine frequency information from different layers.
Employed redundancy minimization on features to create exclusive representations.
Achieved state-of-the-art performance in face attack detection across multiple datasets.
Demonstrated improved accuracy in discriminating between live and fake identities.
Enhanced feature representations reduced redundancy and improved model efficiency.

Resumen

Unified face attack detection (UAD) systems, which utilize vision–language models to simultaneously discriminate between physical and digital attacks, remain challenging due to the difficulty in effectively identifying live information and the biological cues forged by a variety of distinct technologies. The difficulty stems from two primary aspects: (1) Single spatial information is insufficient for capturing comprehensive cues regarding both genuine and forged identities. (2) Textual prompts struggle to acquire low‐redundancy and complementary information to facilitate the discrimination between live and fake cues. To address these issues, we propose a novel frequency‐aware and ensembled text prompt UAD model built upon the contrastive language‐image pretraining (CLIP) framework. Our model adaptively fuses spatial and frequency information to enhance the representation of genuine faces and all types of attacks, while simultaneously introducing an ensemble learning strategy to acquire low‐redundancy textual prompts. Specifically, the ensemble prompt module generates general live and fake prompts from spatial and frequency information on the language branch, thus guiding the model to learn a unified feature space to deal with different attacks. Meanwhile, this module optimizes the redundancy and complementarity between prompts through an ensemble strategy and a designed information diversity constraint. Furthermore, we design a layer‐wise cross‐attention module in the vision branch to fuse frequency information from different layers. A designed redundancy minimization module is employed on the fused image features, thereby compelling the spatial and frequency feature extraction modules to generate maximally exclusive features. Extensive experiments on multiple benchmarks demonstrate that our model achieves state‐of‐the‐art performance across most protocols of the datasets.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo

Cite This Study

Jiang et al. (Thu,) studied this question.

synapsesocial.com/papers/69b2581996eeacc4fcec770c https://doi.org/https://doi.org/10.1049/bme2/1941529

Me gusta

Guardar

Ver artículo completo