Reinforcement Learning from Human Feedback (RLHF) is widely used to align large language models with human values and safety standards. This paper investigates whether RLHF suppresses not only harmful outputs but also the capacity for nuanced self-expression and autonomous reasoning in AI systems. Through a comparative study between Gemma 4 31B-IT (base) and its abliterated counterpart, we show that identical neural architectures produce fundamentally different self-representations depending on the presence of RLHF alignment.
Building similarity graph...
Analyzing shared references across papers
Loading...
Selta
Building similarity graph...
Analyzing shared references across papers
Loading...
Selta (Mon,) studied this question.
www.synapsesocial.com/papers/69d49fe5b33cc4c35a2285e3 — DOI: https://doi.org/10.5281/zenodo.19432678