What question did this study set out to answer?

The study aims to evaluate the effectiveness of various machine learning methods in detecting prompt injection attacks on large language models.

June 17, 2026Open Access

Comparative evaluation of machine learning methods for protecting LLMs from prompt injection attacks

Key Points

The study aims to evaluate the effectiveness of various machine learning methods in detecting prompt injection attacks on large language models.
Evaluated four classes of defense approaches: traditional ML classifiers, fine-tuned encoder-based transformers, specialized detector models, and general-purpose LLMs.
Used a benchmark of over 300k labeled prompts curated from multiple open-source corpora.
Compared detection effectiveness, accuracy, and computational requirements of each approach.
Fine-tuned transformer models achieved the highest detection rate on balanced benchmarks.
Traditional ML-based classifiers showed reasonably strong accuracy while needing less computation.
Specialized detector models had lower recall compared to general-purpose classifiers.

Abstract

Abstract Prompt Injection attacks are a significant threat to the security of Large Language Models, giving adversaries the possibility to manipulate model outputs and bypass guardrails and restrictions. This study explores the effectiveness of selected Machine Learning -based approaches in detecting such attacks, comparing four classes of general-purpose and specialized defense approaches. Our empirical evaluation helps to identify and discuss each method, showing that detection effectiveness is strongly dependent on detection strategies and deployment trade-offs in inline defenses. Using a large benchmark of over 300k labeled prompts, selected following a comparison with three additional open-source corpora, we evaluate (1) traditional ML-based classifiers using embeddings, (2) independently fine-tuned encoder-based transformer models (DistilBERT, RoBERTa, DeBERTa), (3) specialized transformer-based detection models pre-trained for Prompt Injection, and (4) general-purpose LLMs configured as classifiers. Our experiments show that fine-tuned models achieve the highest detection rate on balanced benchmarks, traditional ML-based classifiers provide reasonably strong accuracy while requiring less computation, specialized detector models obtain lower recall, and general-purpose LLM-based classifiers can provide competitive performance, depending on the model used. Our research contributes to improving the robustness of AI-driven security systems against Prompt Injection by presenting a comparative study that can help in selecting and deploying ML-based defense against Prompt Injection attacks.

Demander à l'IA

Bookmark

View Full Paper