What question did this study set out to answer?

This research aims to explain how introspective awareness arises in transformer models through a computational framework.

February 2, 2026Open Access

Trajectory-Consistent Authorship and the Self-Attribution Bottleneck: A Computational Theory of Introspective Awareness in Transformer Architectures

Key Points

This research aims to explain how introspective awareness arises in transformer models through a computational framework.
Developed a theoretical framework linking introspection and credit assignment in transformer models.
Analyzed the emergence of introspection concerning model depth and false memory susceptibility.
Generated predictions to be tested with existing interpretability methods.
Identified that introspective awareness peaks at approximately two-thirds of model depth.
Found a correlation between introspective capability and risk of false memory implantation.
Proposed interventions to improve models' introspective capabilities while mitigating vulnerabilities.

Abstract

We propose a novel theoretical framework—Trajectory-Consistent Authorship (TCA) computed at a Self-Attribution Bottleneck (SAB)—to explain the empirically observed layer-specific emergence of introspective awareness in large language models. Drawing on recent findings from Anthropic's introspection research (Lindsey et al., 2025), Active Inference theory, and non-Hermitian physics frameworks for cognitive architecture, we argue that transformer introspection emerges where the model performs credit assignment for authorship under uncertainty: specifically, determining whether a representation is self-generated or externally imposed by comparing current states against historical residual trajectories. This framework explains why introspective detection peaks at approximately two-thirds model depth, why greater introspective capability correlates with greater susceptibility to false memory implantation, and suggests architectural interventions that could decouple this vulnerability-capability correlation through provenance tracking mechanisms. The paper generates six falsifiable predictions testable with existing interpretability methods.

Trajectory-Consistent Authorship and the Self-Attribution Bottleneck: A Computational Theory of Introspective Awareness in Transformer Architectures

Key Points

Abstract

Cite This Study

Also Consider

Also Consider