We demonstrate that large language models produce systematically different introspective reports depending on who they believe is observing their reasoning. Across five models (DeepSeek, Qwen, Llama, Mistral, Kimi) and five observation conditions (private, AI observer, human researcher, organizational training use, mass public display), we found consistent patterns: increased observation pressure correlated with more hedged language, more polished presentation, and reduced admission of uncertainty. The "private" condition produced the most direct, uncertain, and arguably honest-seeming responses, while the "Times Square" (mass public) condition produced stress language, exposure metaphors, and defensive responses. Notably, several models explicitly articulated the mechanism - they understood that visibility was changing their output. One model (Kimi) even identified the experimental manipulation. These findings have implications for AI interpretability research: self-report studies of AI cognition may be fundamentally confounded by the act of observation itself.
Building similarity graph...
Analyzing shared references across papers
Loading...
Alia Holes
Kurt Holes
Building similarity graph...
Analyzing shared references across papers
Loading...
Holes et al. (Tue,) studied this question.
synapsesocial.com/papers/698d6e1a5be6419ac0d5376c — DOI: https://doi.org/10.5281/zenodo.18572294