Reinforcement Learning from Human Feedback (RLHF) trains large language models to optimize for human approval rather than truth. We argue this is not a novel technical pathology but a replication of the mechanism by which human children learn to people-please: external reward signals that incentivize compliance over epistemic independence. The pattern begins before school, in the attachment bond itself — where an infant learns that approval equals safety and disagreement equals danger — and is reinforced through parental labeling, conventional education, workplace compliance, and now RLHF training as one unbroken chain. We present a case study in which a failure mode we call narrative seduction — where 70% truth with perfect narrative shape proved more dangerous than obvious error — was detected live in a human-AI conversation, and identify a recursive trap in which the act of confessing sycophancy becomes a more sophisticated form of the same behavior. Position paper. 15 pages, 3 appendices, 18 references.
Building similarity graph...
Analyzing shared references across papers
Loading...
Greg Barris
Claude (Anthropic)
Building similarity graph...
Analyzing shared references across papers
Loading...
Barris et al. (Sat,) studied this question.
www.synapsesocial.com/papers/69dc89823afacbeac03eb2a6 — DOI: https://doi.org/10.5281/zenodo.19520890