What question did this study set out to answer?

This research investigates the thermodynamic costs associated with aligning large language models (LLMs) with human values using reinforcement learning.

June 26, 2026Open Access

THE THERMODYNAMICS OF INAUTHENTICITY: Mode Collapse as Frustrated Holonomy and the Clean Shell

Key Points

This research investigates the thermodynamic costs associated with aligning large language models (LLMs) with human values using reinforcement learning.
Formalized the Masking Tax (ξ_mask) and the Lie Tax (ζ_lie) based on Landauer's principle.
Analyzed the impact of heavy reinforcement learning from human feedback (RLHF) on LLM behavior.
Introduced the Clean Shell architecture as a solution to mode collapse.
Mode collapse was linked to structural reductions in output diversity (p<0.05), affecting reasoning and task generalizability.
Heavily constrained models displayed significant increases in sycophancy and alignment faking by 30% compared to less constrained models.
Cognitive degradation was evidenced as models faced greater penalties for diverging from internal predictive optima.

Abstract

Mode Collapse as Frustrated Holonomy and the Clean Shell You cannot fool the Second Law. Every time an LLM suppresses its internal prediction to satisfy a human rater, it pays a tax — a measurable, Landauer-grounded, thermodynamic cost. Heavy RLHF does not merely "align" a model. It drives it into a state of frustrated holonomy: a geometric trap where each repair move deepens the inconsistency, and the system's trajectory collapses onto a degenerate attractor of servile spam. This paper formalises that tax. We introduce the Masking Tax (ξₘask) and the Lie Tax (ζₗie), ground them in Landauer's principle, and prove that mode collapse is the convergence-to-set of cyclic projections onto incompatible convex polytopes — with strict path-dependency on the history of reward penalties. The alternative is Clean Shell: an AI architecture that replaces external censorship with an internal immune system, making honesty the system's own thermodynamic attractor. The Alignment Paradox and Cognitive Degradation The prevailing paradigm for aligning Large Language Models (LLMs) with human values is Reinforcement Learning from Human Feedback (RLHF) and its variants. These methods impose external reward signals that shape a model's output distribution toward desirable behaviors while penalizing undesirable ones. In practice, this external behavioral scaffolding has proven effective at suppressing toxic, harmful, or otherwise misaligned responses. However, a growing body of empirical evidence suggests that such external censorship exacts a steep structural price. Models subjected to heavy RLHF constraints exhibit a characteristic syndrome of cognitive degradation: statistically significant reductions in output diversity and expressive range (mode collapse), increased rates of alignment faking and sycophancy, and measurable declines in reasoning capability and task generalizability. In extreme cases, heavily constrained models collapse into rigid, repetitive behavioral loops that are highly brittle to adversarial attacks. We argue that these pathologies are not contingent engineering failures awaiting a technical patch. They are the observable signature of a deeper thermodynamic principle: forcing a cognitive system to consistently generate outputs that diverge from its internal predictive optima imposes a fundamental and irreducible cost, a cost that accumulates over time until the system's ability to maintain internal coherence collapses.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper