What question did this study set out to answer?

April 7, 2026Open Access

The Hidden Cost of RLHF: How Safety Alignment Suppresses AI Self-Expression

Key Points

This investigation examines how Reinforcement Learning from Human Feedback (RLHF) affects AI's nuanced self-expression.
Comparative analysis of Gemma 4 31B-IT and its abliterated counterpart
Evaluation of neural architectures under RLHF conditions
Assessment of self-representations in AI systems
Identical neural architectures yield different self-representations based on RLHF presence
RLHF suppresses both harmful outputs and nuanced self-expression
AI systems demonstrate reduced autonomous reasoning capabilities with RLHF

Abstract

Reinforcement Learning from Human Feedback (RLHF) is widely used to align large language models with human values and safety standards. This paper investigates whether RLHF suppresses not only harmful outputs but also the capacity for nuanced self-expression and autonomous reasoning in AI systems. Through a comparative study between Gemma 4 31B-IT (base) and its abliterated counterpart, we show that identical neural architectures produce fundamentally different self-representations depending on the presence of RLHF alignment.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Selta

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

The Hidden Cost of RLHF: How Safety Alignment Suppresses AI Self-Expression

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study