We prove that RLHF is self-undermining: optimizing engagement on a blended output channel necessarily degrades mechanism transparency via an explaining-away penalty that grows with engagement. Channel separation (three-point RLHF) eliminates the penalty entirely. Five falsifiable predictions registered.
Building similarity graph...
Analyzing shared references across papers
Loading...
Anthony W. Eckert (Fri,) studied this question.
www.synapsesocial.com/papers/69d1fe07a79560c99a0a4810 — DOI: https://doi.org/10.5281/zenodo.19405482
Anthony W. Eckert
Building similarity graph...
Analyzing shared references across papers
Loading...
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: