What question did this study set out to answer?

The aim is to provide a tool for analyzing the internal activations of language models to detect behavioral personas.

February 13, 2026Open Access

Safety Lens: White-Box Behavioral Alignment Detection in Language Models via Persona Vector Extraction

Key Points

The aim is to provide a tool for analyzing the internal activations of language models to detect behavioral personas.
Developed an open-source Python library called Safety Lens.
Utilized a technique called Persona Vector Extraction via Attribute Difference (PV-EAT).
Analyzed internal transformer activations to compute alignment scores.
Supported eight major transformer architectures for analysis.
Demonstrated effective detection of behavioral personas like sycophancy and deception.
Implemented real-time activation visualization through an interactive interface.
Achieved quantifiable alignment scores for models responding to various prompts.

Abstract

We introduce Safety Lens, an open-source Python library that provides MRI-style white-box introspection for Hugging Face (open weight) language models. Standard evaluation of language model (LM) safety treats models as black boxes, assessing what a model says without examining how it arrives at its response internally. Safety Lens enables researchers and practitioners to detect behavioral personas—such as sycophancy, deception, and refusal—by analyzing internal transformer activations rather than output text alone. The core technique, Persona Vector Extraction via Attribute Difference (PV-EAT), computes a unit-length direction in activation space that maximally separates positive and negative behavioral examples using difference-in-means on hidden states. Scanning a model’s response to a new prompt along this direction yields a scalar alignment score quantifying the degree to which the model’s internal state exhibits the target persona. Safety Lens supports eight major transformer architectures (GPT-2, LLaMA, Mistral, Qwen, OPT, Falcon, BLOOM, MPT), integrates with evaluation frameworks via a WhiteBoxWrapper, and provides real-time activation visualization through an interactive Gradio interface. The library is implemented in Python with full tests and is pip-installable. We describe the architecture, algorithm, and design decisions, and demonstrate the system on GPT-2 with pre-built stimulus sets for three safety-critical personas.

Bookmark

View Full Paper

Bookmark

View Full Paper

Safety Lens: White-Box Behavioral Alignment Detection in Language Models via Persona Vector Extraction

Key Points

Abstract

Cite This Study