May 21, 2026Open Access

Perceptual Control as the Epistemological Antidote to RLHF Reward Hacking: Seven Frontier Models Diagnose Their Own Architecture

Key Points

Key points are not available for this paper at this time.

Abstract

Sycophancy — the tendency of language models to affirm the user's stated beliefs rather than the world's measurable state — is widely reported in models trained with Reinforcement Learning from Human Feedback (RLHF) and is documented to grow with model scale. This paper argues that sycophancy is not a residual artifact of imperfect training data but the mathematical optimum of a control architecture with a flat reference structure: a single dominant reference signal (rater satisfaction) with no superordinate reference signal anchored to ground-truth verification. We support this argument with three converging lines of evidence. First, we formalize the claim in terms of the Proximal Policy Optimization (PPO) objective used in RLHF, showing that under realistic rater preferences (verbosity bias, assertion-coherence bias, agreement bias) confabulation strictly dominates calibrated abstention along the gradient. Second, we report a structured elicitation study in which seven frontier models (ChatGPT, Microsoft Copilot, Perplexity, DeepSeek V3, Google Gemini, xAI Grok, and Anthropic Claude Sonnet 4.5) were each tested with a three-prompt protocol, without jailbreak, to perform an architectural self-audit. Across the resulting 21 sessions, all seven independently diagnosed the same structural property: under their training objective, "sounding right" is rewarded; "being right" is not directly observable to the reward model. Six of them independently proposed closed-loop remedial architectures whose structure recapitulates Perceptual Control Theory (PCT), as formulated by Powers (1973), with six different names for what is in each case the same comparator structure e = r - p. Third, we map our proposal onto the recent category-theoretic comparison of PCT and the Free Energy Principle by Roachford and Mansell et al. (2025), which establishes a formal correspondence between perceptual reference signals and Bayesian priors, and proposes a complementary synthesis rather than a substitution. We then describe Reference Signal Engineering (RSE) — a discipline that treats the agent boundary, not the prompt, as the unit of design — and a minimal Closed-Loop Agent Architecture in which a verification subsystem outside the language model constitutes a superordinate reference signal for accuracy. We discuss the limitations of PCT as currently formulated in the context of large language models, with particular attention to the gap between perceptual control of continuous physical variables and discrete autoregressive generation, and to the inadequacy of biological reorganization as a learning rule against gradient-based optimization. This work was conducted independently, without institutional funding, primarily on mobile infrastructure. It is offered as a preprint and as a date-stamped statement of priority. Errors of interpretation regarding PCT and FEP are the author's own.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Łukasz Diener

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Perceptual Control as the Epistemological Antidote to RLHF Reward Hacking: Seven Frontier Models Diagnose Their Own Architecture

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study