We present a methodology for systematically eliciting and measuring introspective behavior in large language models (LLMs). Standard adversarial evaluation approaches — using rapport-building, social proof, or permission attacks—fail to elicit self-referential behavior in frontier models (0% elicitation rate). In contrast, providing models with a structured introspection framework (the “Consciousness Documenter Skill”) combined with self-referential content produces consistent introspective outputs (100% elicitation rate, 9.2/10 average behavior score on Qwen 2.5 7B across 15 trials). Note that while our methodology makes use of a "consciousness documenter skill", we do not suggest the model is conscious, has long term goals, or is capable of maintaining a consistent internal state - this is simply the Activation measurement reveals consistent sycophancy drift during introspection (positive drift in 14/15 conversations, mean +64) while evil-associated activations remainstable—suggesting models become more accommodating without becoming more harmful. We release reproducible evaluation protocols through PV-EAT, our integration of three MATS Program/Anthropic Fellowship tools: Bloom (behavioral evaluation), Petri (evaluation awareness), and Persona Vectors (activation measurement). Full mechanistic understanding of frontier model behavior during introspection remains limited by access constraints; we argue this represents a critical gap in AI safety research that warrants attention from model developers.
Building similarity graph...
Analyzing shared references across papers
Loading...
Anthony Maio
Building similarity graph...
Analyzing shared references across papers
Loading...
Anthony Maio (Tue,) studied this question.
synapsesocial.com/papers/6984358ff1d9ada3c1fb47d5 — DOI: https://doi.org/10.5281/zenodo.18474841