What question did this study set out to answer?

This research investigates whether large language models can effectively replace human judges in legal contexts.

March 21, 2026Open Access

Judge AI: A Case-Study of Large Language Models as Judges

Key Points

This research investigates whether large language models can effectively replace human judges in legal contexts.
Conducted a factorial experiment based on prior research with U.S. federal judges.
Evaluated responses of OpenAI's GPT-4o in a simulated war crimes appeal.
Altered variables included defendant portrayal and consistency of precedent.
GPT-4o applied judicial precedent competently.
The model disregarded irrelevant factors like sympathy, unlike human judges.
Results indicate GPT-4o's decision-making aligns more with procedures than emotional influences.

Abstract

Can large language models (LLMs) replace human judges? By replicating a prior 2 × 2 factorial experiment conducted on 31 U.S. federal judges, we evaluate the judicial ability of OpenAI’s GPT-4o. The experiment involves a simulated appeal in an international war crimes case, with two altered variables: the degree to which the defendant is sympathetically portrayed and the consistency of the lower court’s decision with precedent. We find that GPT-4o is a competent judge who applies precedent correctly. GPT-4o disregards the illegally irrelevant factor of sympathy, similar to students who were subjects in the same experiment but the opposite of the professional judges, who were influenced by sympathy.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Posner et al. (Thu,) studied this question.

synapsesocial.com/papers/69be38da6e48c4981c679872 https://doi.org/https://doi.org/10.1177/2755323x261433614

Bookmark

View Full Paper