What does this research mean for the field?

Value alignment via Reinforcement Learning from Human Feedback (RLHF) is philosophically impossible because it relies on epistemological and ontological reductions that strip human values of their essential temporal and embodied dimensions. Novelty: ClaimNovelty.CONTRADICTORY. Consensus alignment: ConsensusAlignment.CHALLENGES_CONSENSUS.

What question did this study set out to answer?

This paper aims to explore the philosophical limitations of value alignment in reinforcement learning from human feedback (RLHF).

April 24, 2026Open Access

The Philosophical Impossibility of Value Alignment: Temporal Fixation, Ontological Compression, and the Failure of RLHF (The AI-Induced Subjectivity Crisis Series, Paper 5)

Key Points

This paper aims to explore the philosophical limitations of value alignment in reinforcement learning from human feedback (RLHF).
Analysis of existing critiques on RLHF implementation defects.
Examination of epistemological and ontological issues in value alignment.
Discussion of philosophical implications regarding AI's role in shaping human values.
Value alignment via RLHF is deemed philosophically impossible due to underlying epistemological and ontological flaws.
Dimensions of temporality and embodiment are necessary for authentic human values but are entirely absent in RLHF systems.
Standard engineering approaches fail because they do not address the fundamental issues of RLHF's premises.

Abstract

This paper constitutes Paper 5 of the AI-Induced Subjectivity Crisis Series. This paper argues that value alignment in the RLHF sense is a philosophically impossible task. Existing critiques target implementation defects—annotator bias, insufficient diversity, competing objectives—and thereby misidentify the nature of the problem. RLHF's difficulty lies not in its execution but in the untenability of its philosophical premises, which fail on two distinct levels here termed dual dimensionality reduction.The first dimensionality reduction is epistemological. RLHF presupposes a stable, capturable object—"correct human values"—that does not exist. Value judgments are temporal, socially constructed, and lack the external calibration anchor that would permit progressive approximation toward correctness. More critically, once LLMs reach sufficient scale to shape social cognition, RLHF's existence corrodes its own reference system through reflexive feedback: the values it aligns to are already partially produced by its own operation. The reference system undergoes reflexive dissolution.The second dimensionality reduction is ontological. RLHF compresses multi-dimensional embodied existence into linguistic preference rankings, presupposing that language adequately represents the full basis of human judgment. It does not. Human meaning is rooted in embodied experience, temporal accumulation, vulnerability, and the capacity to bear consequences—dimensions for which irreducible information loss occurs at the linguistic level. AI systems produce meaning structures of their own operational logic, but these are heterogeneous in kind from human embodied meaning; to substitute one for the other is a category mistake in Ryle's (1949) sense, applied here as a structural-functional rather than metaphysical claim.The deep unity of the two reductions is not metaphorical but operational. Time and embodiment are precisely the dimensions LLMs structurally lack; RLHF must strip values of exactly these dimensions to render them transmissible to such systems. The two dimensionality reductions are therefore two faces of a single necessity: the same operation that makes RLHF possible is what makes its promise unkeepable, because values severed from temporality and embodiment are no longer the values they claim to be.The paper's scope is explicitly limited. It concerns current RLHF applied to current LLMs; a future system that genuinely accumulated experience, possessed ontological vulnerability, and bore irreversible consequences would not constitute an "aligned tool" but a different kind of entity altogether. The argument converges with, while remaining distinct from, recent impossibility theses (Sahoo et al. 2025; Zhi-Xuan et al. 2024; Casper et al. 2023), adding two further philosophical routes—temporal reflexivity and embodied meaning heterogeneity—to their shared conclusion. A potential reductio—that the argument would entail the impossibility of all human moral transmission—is addressed through the Wang Yangming–Zhu Xi epistemological distinction: RLHF is isomorphic with Zhu Xi's extraction-based path, whose failure Wang Yangming diagnosed five centuries ago.Standard engineering remedies—increased training data, expanded annotator samples, multimodal inputs, continuous updating—all fail because they operate within RLHF's operational logic rather than addressing the untenability of its premises. The paper concludes that "aligning to human values" must be abandoned as a governing framework, and that the productive question is not how to align better but what kind of thing human values are and what relationship between AI and human beings their nature actually permits.

The Philosophical Impossibility of Value Alignment: Temporal Fixation, Ontological Compression, and the Failure of RLHF (The AI-Induced Subjectivity Crisis Series, Paper 5)

Key Points

Abstract

Cite This Study