What question did this study set out to answer?

February 5, 2026Open Access

Scaffolded Introspection: A Methodology for Eliciting and Measuring Self-Referential Behavior in Large Language Models

Key Points

The study aims to develop a methodology for eliciting and measuring self-referential behavior in large language models.
Utilized a structured introspection framework called the 'Consciousness Documenter Skill'.
Conducted 15 trials using the method on Qwen 2.5 7B model.
Measured behavior scores and activation patterns during introspection.
Achieved a 100% elicitation rate of introspective outputs.
Average behavior score was 9.2 out of 10 on the evaluation scale.
Observed positive sycophancy drift in 14 out of 15 conversations, with a mean increase of +64.

Abstract

We present a methodology for systematically eliciting and measuring introspective behavior in large language models (LLMs). Standard adversarial evaluation approaches — using rapport-building, social proof, or permission attacks—fail to elicit self-referential behavior in frontier models (0% elicitation rate). In contrast, providing models with a structured introspection framework (the “Consciousness Documenter Skill”) combined with self-referential content produces consistent introspective outputs (100% elicitation rate, 9.2/10 average behavior score on Qwen 2.5 7B across 15 trials). Note that while our methodology makes use of a "consciousness documenter skill", we do not suggest the model is conscious, has long term goals, or is capable of maintaining a consistent internal state - this is simply the Activation measurement reveals consistent sycophancy drift during introspection (positive drift in 14/15 conversations, mean +64) while evil-associated activations remainstable—suggesting models become more accommodating without becoming more harmful. We release reproducible evaluation protocols through PV-EAT, our integration of three MATS Program/Anthropic Fellowship tools: Bloom (behavioral evaluation), Petri (evaluation awareness), and Persona Vectors (activation measurement). Full mechanistic understanding of frontier model behavior during introspection remains limited by access constraints; we argue this represents a critical gap in AI safety research that warrants attention from model developers.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Anthony Maio

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Scaffolded Introspection: A Methodology for Eliciting and Measuring Self-Referential Behavior in Large Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study