What question did this study set out to answer?

February 5, 2026Open Access

Scaffolded Introspection: A Methodology for Eliciting and Measuring Self-Referential Behavior in Large Language Models

Key Points

The study aims to develop a methodology for eliciting and measuring self-referential behavior in large language models.
Utilized a structured introspection framework called the 'Consciousness Documenter Skill'.
Conducted 15 trials using the method on Qwen 2.5 7B model.
Measured behavior scores and activation patterns during introspection.
Achieved a 100% elicitation rate of introspective outputs.
Average behavior score was 9.2 out of 10 on the evaluation scale.
Observed positive sycophancy drift in 14 out of 15 conversations, with a mean increase of +64.

Abstract

We present a methodology for systematically eliciting and measuring introspective behavior in large language models (LLMs). Standard adversarial evaluation approaches — using rapport-building, social proof, or permission attacks—fail to elicit self-referential behavior in frontier models (0% elicitation rate). In contrast, providing models with a structured introspection framework (the “Consciousness Documenter Skill”) combined with self-referential content produces consistent introspective outputs (100% elicitation rate, 9.2/10 average behavior score on Qwen 2.5 7B across 15 trials). Note that while our methodology makes use of a "consciousness documenter skill", we do not suggest the model is conscious, has long term goals, or is capable of maintaining a consistent internal state - this is simply the Activation measurement reveals consistent sycophancy drift during introspection (positive drift in 14/15 conversations, mean +64) while evil-associated activations remainstable—suggesting models become more accommodating without becoming more harmful. We release reproducible evaluation protocols through PV-EAT, our integration of three MATS Program/Anthropic Fellowship tools: Bloom (behavioral evaluation), Petri (evaluation awareness), and Persona Vectors (activation measurement). Full mechanistic understanding of frontier model behavior during introspection remains limited by access constraints; we argue this represents a critical gap in AI safety research that warrants attention from model developers.

Scaffolded Introspection: A Methodology for Eliciting and Measuring Self-Referential Behavior in Large Language Models

Key Points

Abstract

Cite This Study