External guardrails for LLM safety add latency and compute overhead while remaining blind to internal model reasoning. We ask: does the model already know when content is harmful? We extract activations fromLLaMA-3.1-8B and train lightweight MLP classifier probes (12.6M parameters) to detect harmful prompts. Evaluated on WildJailbreak, Beavertails, and AEGIS 2.0, our probes achieve F1 scores of 99%, 83%, and 84% respectively competitive with 1000×+ larger guard models while cutting latency and compute costs.
Building similarity graph...
Analyzing shared references across papers
Loading...
Alizishaan Khatri
Chiquita Prabhu
Omkar Neogi
GTx (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Khatri et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69be36d46e48c4981c6760b6 — DOI: https://doi.org/10.5281/zenodo.19077951