What question did this study set out to answer?

The central aim is to evaluate the legal reasoning capabilities of LLMs in Taiwan’s family and inheritance law and the effects of CoT prompting.

April 1, 2026Open Access

Structured Evaluation of Legal Reasoning in LLMs: Chain-of-Thought Prompting and Human Scoring for Retrieval Robustness

Key Points

The central aim is to evaluate the legal reasoning capabilities of LLMs in Taiwan’s family and inheritance law and the effects of CoT prompting.
Analyzed six essay questions from judicial and graduate law exams decomposed into 68 sub-questions.
Evaluated four LLMs using a two-stage framework: sub-question accuracy and full-length essay performance.
Conducted human scoring by a law professor and a student.
CoT prompting improved legal reasoning quality across all models.
Gemini showed the most significant accuracy improvement from 83.2% to 94.5% (p < 0.05).
The study proposed a diagnostic evaluation method for multilingual legal corpora.

Abstract

This study investigates the legal reasoning abilities of Large Language Models (LLMs) in Taiwan’s Status Law (family and inheritance law) and evaluates the effects of Chain-of-Thought (CoT) prompting on answer quality. Six essay questions from past judicial and graduate law exams were decomposed into 68 sub-questions targeting issue spotting, statutory application, legal reasoning, and property calculation. Four LLMs (ChatGPT-4o, Gemini, Copilot, and Grok3) were evaluated using a two-stage framework: decomposed sub-question accuracy (Stage 1) and full-length essay response performance with and without CoT prompting (Stage 2), with human scoring conducted by a law professor and a student. Results show that CoT prompting consistently improves legal reasoning quality across models, notably enhancing issue coverage, statutory citation accuracy, and reasoning structure. Gemini achieved the most significant accuracy gains (from 83.2% to 94.5%, p < 0.05) and was selected for detailed qualitative analysis. Beyond model-specific findings, this study contributes to retrieval evaluation research by addressing statistical consistency challenges in human scoring, proposing a diagnostic evaluation method adaptable for multilingual and multimedia legal corpora, and suggesting extensions for evaluating enterprise-level legal information systems. These findings underscore the value of structured prompting strategies in supporting more interpretable, transferable, and scalable legal AI evaluation frameworks.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper