Can large language models (LLMs) replace human judges? By replicating a prior 2 × 2 factorial experiment conducted on 31 U.S. federal judges, we evaluate the judicial ability of OpenAI’s GPT-4o. The experiment involves a simulated appeal in an international war crimes case, with two altered variables: the degree to which the defendant is sympathetically portrayed and the consistency of the lower court’s decision with precedent. We find that GPT-4o is a competent judge who applies precedent correctly. GPT-4o disregards the illegally irrelevant factor of sympathy, similar to students who were subjects in the same experiment but the opposite of the professional judges, who were influenced by sympathy.
Posner et al. (Thu,) studied this question.