What question did this study set out to answer?

The research aims to develop a framework for evaluating how biases in large language models evolve over time.

March 10, 2026Open Access

Evaluating Biases in Large Language Models Over Time: A Framework With a GPT Case Study on Political Bias

Key Points

The research aims to develop a framework for evaluating how biases in large language models evolve over time.
Developed a model-agnostic framework for longitudinal bias evaluation
Implemented multi-prompt questionnaires to analyze political bias
Conducted longitudinal statistical assessments to quantify biases
Performed cross-questionnaire correlation analyses to uncover orthogonal biases
Executed sensitivity analyses on role-assignment mechanisms in the model
Younger models display less left-leaning political bias
Despite shifts, newer models still show progressive personality traits
Observation of clear political shifts across 4000 generated responses
Evidence of persistent biases throughout model updates

Abstract

ABSTRACT Large Language Models (LLMs) have repeatedly been shown to reflect systematic biases. At the same time, commercial LLMs are updated at a rapid rate, often without notice to end‐users, so that a bias profile captured today may already be outdated tomorrow. However, the literature still leans heavily on one‐shot evaluations of single model versions, leaving a gap in our understanding of how biases evolve over time and how they should be monitored. We address this gap by introducing a framework for longitudinal evaluation of biases in LLMs, focusing on political bias as a case study. The framework is model‐agnostic, reproducible, and user‐friendly. It consists of (i) locking model versions via dated identifiers to guarantee temporal comparability, (ii) multi‐prompt questionnaires on position statements to analyze potential biases; and (iii) a longitudinal statistical evaluation framework that quantifies and infers absolute bias and drifts between models. Moreover, we suggest conducting (iv) cross‐questionnaire correlation analyses to reveal orthogonal biases, as well as (v) sensitivity analyses on the model's role‐assignment mechanisms to analyze robustness to concrete instructions. All code, prompts, and outputs are openly available to facilitate replication and extension to other bias analyses. To illustrate the framework, we investigate the political biases and personality traits of ChatGPT, specifically comparing GPT‐3.5, GPT‐4, GPT‐4o, and GPT‐5.2. In addition, the ability of the models to emulate political viewpoints (e.g., liberal or conservative positions) is analyzed. Across 4000 generated answers, we observe clear political shifts between versions: While newer models appear less left‐leaning, they still mimic progressive personality profiles and exhibit biases. These findings demonstrate the persistence and transformation of biases across updates, underlining the need for longitudinal monitoring.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Meltem Aksoy

Erik Weber

Jérôme Rutinowski

Journals

Applied Stochastic Models in Business and Industry

Actions

Institutions

TU Dortmund University

Machine Intelligence Research Institute

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluating Biases in Large Language Models Over Time: A Framework With a GPT Case Study on Political Bias

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider