We propose a novel theoretical framework—Trajectory-Consistent Authorship (TCA) computed at a Self-Attribution Bottleneck (SAB)—to explain the empirically observed layer-specific emergence of introspective awareness in large language models. Drawing on recent findings from Anthropic's introspection research (Lindsey et al., 2025), Active Inference theory, and non-Hermitian physics frameworks for cognitive architecture, we argue that transformer introspection emerges where the model performs credit assignment for authorship under uncertainty: specifically, determining whether a representation is self-generated or externally imposed by comparing current states against historical residual trajectories. This framework explains why introspective detection peaks at approximately two-thirds model depth, why greater introspective capability correlates with greater susceptibility to false memory implantation, and suggests architectural interventions that could decouple this vulnerability-capability correlation through provenance tracking mechanisms. The paper generates six falsifiable predictions testable with existing interpretability methods.
Bartz et al. (Sat,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: