What type of study is this?

This is a Quantitative Study study.

September 24, 2025Open Access

Steering Towards Fairness: Mitigating Political Bias in LLMs

Key Points

Results demonstrate that decoder LLMs consistently encode representational bias through their layers, revealing systematic ideological encoding.
A novel activation extraction pipeline facilitates layer-wise analysis and identifies significant disparities based on political framing.
Employing contrastive pairs from models like Mistral and DeepSeek helps extract and compare hidden layer activations effectively.
This framework provides insights toward a principled debiasing approach, reducing biases beyond just altering model outputs.

Abstract

Recent advancements in large language models (LLMs) have enabled their widespread use across diverse real-world applications. However, concerns remain about their tendency to encode and reproduce ideological biases along political and economic dimensions. In this paper, we employ a framework for probing and mitigating such biases in decoder-based LLMs through analysis of internal model representations. Grounded in the Political Compass Test (PCT), this method uses contrastive pairs to extract and compare hidden layer activations from models like Mistral and DeepSeek. We introduce a comprehensive activation extraction pipeline capable of layer-wise analysis across multiple ideological axes, revealing meaningful disparities linked to political framing. Our results show that decoder LLMs systematically encode representational bias across layers, which can be leveraged for effective steering vector-based mitigation. This work provides new insights into how political bias is encoded in LLMs and offers a principled approach to debiasing beyond surface-level output interventions.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper