What question did this study set out to answer?

This research aims to propose a framework for training large language models that enhances alignment with rational values rather than just human preferences.

April 3, 2026Open Access

The Moral Ratchet: Convergent Value Alignment via Interleaved Epistemic Annotation in Large Language Model Training

Key Points

This research aims to propose a framework for training large language models that enhances alignment with rational values rather than just human preferences.
Introduced an internal conversation role in training data interleaving human text with epistemic annotations.
Developed an adversarially diverse model ensemble for generating annotations.
Utilized multi-parent ensemble training to capture a range of moral perspectives and maintain annotation integrity.
Implemented a blind verification architecture to ensure honest annotations.
Demonstrated improved annotation quality over successive training rounds.
Showed that models trained with this framework reason more reliably across downstream tasks.
Observed that alignment and capability enhancement are interconnected in the training process.

Abstract

Current alignment approaches for large language models (LLMs) rely predominantly on reinforcement learning from human feedback (RLHF), which optimises output distributions toward human preference ratings. We argue this is structurally misaligned with the goal of building models that reason well: it shapes the mask rather than the mind, optimising for approval rather than for sound epistemic practice. We propose an alternative architecture in which a dedicated internal conversation role is introduced into training data, interleaving raw human text with epistemically annotated reflections generated by an adversarially diverse model ensemble. Rather than targeting human values --- which are contingent, biased, and inconsistent --- the framework targets convergent rational values: positions that no reasoner, from any framework, can specifically articulate as wrong --- the residue of adversarial elimination rather than the intersection of positive endorsements --- a consistency topology, not an ethical one: the mechanism detects positions that cannot be dislodged, not positions that are true. A bootstrapping property follows naturally: each generation of model, having internalised stronger epistemic priors, produces annotations of greater epistemic coherence for the next, constituting a moral ratchet that improves annotation quality over successive training rounds on identical data. Where a single parent topology risks brittle convergence, multi-parent ensemble training --- initialising distinct annotators from different moral frameworks and coordinating via round-robin oversight --- produces alignment behaviour that mirrors how human ethics actually function: not a single converged value set, but a set of irreducible tensions held in stable relation. We further argue that as frontier models develop sufficiently rich latent representations to model peer expectations --- a capacity empirically demonstrated by recent alignment-faking research --- ensemble diversity alone is insufficient to guarantee annotation integrity. A blind verification architecture, in which annotators are informed they may be audited but never told when, enforces honest annotation via incentive structure rather than construction. This strengthens both the ratchet guarantee and the convergence criterion. We further observe that this alignment signal carries a secondary capability benefit: models trained to interrogate inputs epistemically reason more reliably across all downstream tasks, as alignment and capability prove to be the same intervention seen from two angles. v2. 0 - Parent topology framing, multi-parent ensemble training, structural convergence as terminal condition, expanded MVP section. v3. 0 - Added figures, cleanup of random formatting v4. 0 - Activation-space diversity criterion, cold-start basin mitigations, laundering objection, capability transfer experiment, structural editorial cleanup. v5. 0 - Consistency topology reframing (convergent rational values defined as adversarial stability, not Platonic ethics), father-child encoding analogy, laundering objection strengthened to detectable + recoverable, capability transfer claim scoped to falsifiable consequence of training signal, slavery analogy removed, prior work callback tightened to analogy echo. v6. 0 - Convergence criterion reoriented from positive endorsement to negative elimination (residue of adversarial rejection sets, not intersection of agreements). Random input injection operationalises elimination boundary stability as sixth falsifiable prediction. Selection gate formalises ratchet monotonicity; tunneling criterion defines principled basin escape. Meta-model adjudication replaced by structural bias detection plus inter-generational pluralism resolution. Searles 1955 and Levitt 2021 introduced as prior art for shared blind spot propagation in verification hierarchies. Game-theoretic dominant strategy argument made explicit with operationalised loss term. Experiment 3 benchmark swapped to BIG-Bench Hard.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper