At present, there is growing interest in automated biometric identification applications. For these, it is crucial to have a system capable of accurately identifying a specific group of people while also detecting individuals who do not belong to that group. In face identification models that use Deep Learning (DL) techniques, this context is referred to as Open-Set Recognition (OSR), which is the focus of this work. This scenario presents a substantial challenge for this type of system, as it involves the need to effectively identify unknown individuals who were not part of the system’s training data. In this context, where the accuracy of this type of system is considered crucial, selecting the model to be used in each scenario becomes key. It is within this context that our work arises. Here, we present the results of a rigorous comparative analysis examining the precision of some of the most widely used models today for face identification, specifically some Convolutional Neural Network (CNN) models compared with a Vision Transformer (ViT) model. All models were pre-trained on the same large dataset and evaluated in an OSR scenario. The results show that ViT achieves the highest precision, outperforming CNN baselines and demonstrating better generalization for unknown identities. These findings support recent evidence that ViT is a promising alternative to CNN for this type of application.
Galván et al. (Sat,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: