What question did this study set out to answer?

This research aims to evaluate the competence of large language models (LLMs) in grading law school exams compared to human professors.

April 10, 2026Open Access

Grading Machines: Can AI Exam-Grading Replace Law Professors?

Key Points

This research aims to evaluate the competence of large language models (LLMs) in grading law school exams compared to human professors.
Analyzed existing LLMs for grading capability on legal analysis questions.
Used data from law school exams across four subjects from top-30 U.S. law schools.
Implemented a detailed grading rubric for evaluation to measure correlation with human graders.
LLM grading correlates with human grading at Pearson correlation coefficients up to 0.93.
LLMs show potential to assist professors by reviewing grades and providing feedback to students.
Findings indicate a future role for LLMs in legal education, enhancing grading efficiency.

Abstract

In the past few years, large language models (LLMs) have achieved significant technical advances, enabling legal-advocacy organizations to adopt them as complements to—or substitutes for—lawyers and other human experts. The role of LLMs in legal education, however, is underexplored. While several studies have examined LLMs’ performance in taking law school exams, finding mixed results, there have been no published studies systematically analyzing LLMs’ competence at one of law professors’ chief responsibilities: grading law school exams. This paper presents results of an analysis of how LLMs perform in evaluating student responses to legal analysis questions of the kind typically contained in law school exams. The data come from exams in four subjects administered at top-30 U.S. law schools. Unlike some projects in computer or data science, our goal is not to design a new LLM that minimizes error or that maximizes agreement with human graders. Rather, we seek to determine whether existing models—which can be straightforwardly applied by most professors and students—are already suitable for law exam evaluation. We find that, when provided with a detailed rubric, the LLM grades correlate with the human grader at Pearson correlation coefficients of up to 0.93. Our findings suggest that, even if they do not fully replace humans in the near future, LLMs could soon be put to valuable tasks by law school professors, such as reviewing and validating professor grading, providing substantive feedback on ungraded midterms, and providing students feedback on self-administered practice exams.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper