This work explores how deep learning models, with different numbers of parameters, can be effectively applied to detect personal data within unstructured text using Named Entity Recognition (NER) techniques. We evaluate the performance of various architectures by leveraging a plethora of language models (LMs) consisting of Distilbert-base-uncased, Distilbert-base-cased, Bert-base-uncased, Bert-base-cased, Bert-large-uncased, Bert-large-cased, ModernBERT-base, ModernBERT-large, nomic-BERT-2048, RoBERTa-base, DistilRoBERTa-base, RoBERTa-large, Deberta-v3-xsmall, Deberta-v3-small, and Deberta-v3-base, which are evaluated using the performance indices of accuracy, precision, recall, and F1-score. Our experiments show that some Small Language Models (SLMs) compete equally with some corresponding LLMs (Large Language Models), based on the specific PII (Personally Identifiable Information) dataset, thus enhancing personal data detection, which is of paramount importance in financial applications. Moreover, we proposed a novel architecture based on an optimized transformer fine-tuning strategy to improve PII recognition across diverse contexts and conducted an extensive comparative analysis to evaluate the performance of our proposed architecture in relation to all relevant existing approaches reported in the literature. This evaluation, performed on the AI4Privacy PII 43 K dataset, encompasses every publicly available work we identified and provides a thorough benchmarking of our methods within the current research field. The results highlight both the strengths and limitations of existing solutions and demonstrate the effectiveness of SLMs in addressing the challenges of privacy-preserving information extraction.
Psarra et al. (Mon,) studied this question.