What question did this study set out to answer?

The aim is to characterize jailbreak vulnerabilities in large language models through a novel domain-based taxonomy.

April 3, 2026Open Access

A domain-based taxonomy of jailbreak vulnerabilities in large language models

Key Points

The aim is to characterize jailbreak vulnerabilities in large language models through a novel domain-based taxonomy.
Introduced a taxonomy of jailbreak attacks framed by the training domains of LLMs.
Classified attacks based on underlying model deficiencies rather than prompt methods.
Conducted an extensive review of the LLM jailbreaking literature to support the proposed taxonomy.
Identified four categories of jailbreak attacks: mismatched generalization, competing objectives, adversarial robustness, and mixed attacks.
Provided insights into the limitations of current approaches for mitigating jailbreak vulnerabilities.
Analyzed the gaps in alignment that lead to vulnerabilities in LLM outputs.

Abstract

The study of large language models (LLMs) is a key area in open-world machine learning. Although LLMs demonstrate remarkable natural language processing capabilities, they also face several challenges, including consistency issues, hallucinations, and jailbreak vulnerabilities. Jailbreaking refers to the crafting of prompts that bypass alignment safeguards, leading to unsafe outputs that compromise the integrity of LLMs. This work specifically focuses on the challenge of jailbreak vulnerabilities and introduces a novel taxonomy of jailbreak attacks grounded in the training domains of LLMs. It characterizes alignment failures as arising from gaps in generalization, objectives, and robustness. Our primary contribution is a perspective on jailbreak, framed through the different linguistic domains that emerge during LLM training and alignment. This viewpoint highlights the limitations of existing approaches and enables us to classify jailbreak attacks in terms of the underlying model deficiencies they exploit. Unlike conventional classifications that categorize attacks based on prompt construction methods (e.g., prompt templating), our approach provides a deeper understanding of LLM behavior. We introduce a taxonomy with four categories—mismatched generalization, competing objectives, adversarial robustness, and mixed attacks— offering insights into the fundamental nature of jailbreak vulnerabilities. Finally, we present key lessons derived from this taxonomic study. • We perform an analysis on why aligned Large Language Models (LLMs) are vulnerable to jailbreaking attacks. These attacks allows an user to generate answers against the policies of a LLM company. The analysis is done from a domain perspective, where we distinguish different training domain regions. This analysis is based and extended from the Jailbroken hypothesis, a paper published at the NeurIPS conference. • Based on the previous analysis, we propose a taxonomy to classify LLMs jailbreaking attacks. Three main types of attacks are distinguished, namely mismatched generalization, competing objectives and adversarial robustness. We further categorize these types of attacks into subgroups. • We extensively review the LLM jailbreaking literature to support our taxonomy, classifying them into each of the categories we propose.

A domain-based taxonomy of jailbreak vulnerabilities in large language models

Key Points

Abstract

Cite This Study

Also Consider

Also Consider