What question did this study set out to answer?

This research aims to determine whether large language models can internally recognize harmful content without external guardrails.

March 21, 2026Open Access

Safety Beyond the Interface: Detecting Harm via Latent LLM States

Key Points

This research aims to determine whether large language models can internally recognize harmful content without external guardrails.
Extracted activations from LLaMA-3.1-8B
Trained lightweight MLP classifiers with 12.6 million parameters
Evaluated using datasets: WildJailbreak, Beavertails, and AEGIS 2.0
Achieved F1 scores of 99%, 83%, and 84% on the respective datasets
Probes performed competitively compared to larger guard models
Reduced latency and computational costs compared to traditional methods

Abstract

External guardrails for LLM safety add latency and compute overhead while remaining blind to internal model reasoning. We ask: does the model already know when content is harmful? We extract activations fromLLaMA-3.1-8B and train lightweight MLP classifier probes (12.6M parameters) to detect harmful prompts. Evaluated on WildJailbreak, Beavertails, and AEGIS 2.0, our probes achieve F1 scores of 99%, 83%, and 84% respectively competitive with 1000×+ larger guard models while cutting latency and compute costs.

Read Full Paperexternally

AIに質問

Bookmark

View Full Paper

Cite This Study

Khatri et al. (Mon,) studied this question.

synapsesocial.com/papers/69be36d46e48c4981c6760b6 https://doi.org/https://doi.org/10.5281/zenodo.19077951

AIに質問

Bookmark

View Full Paper