This study investigates the legal reasoning abilities of Large Language Models (LLMs) in Taiwan’s Status Law (family and inheritance law) and evaluates the effects of Chain-of-Thought (CoT) prompting on answer quality. Six essay questions from past judicial and graduate law exams were decomposed into 68 sub-questions targeting issue spotting, statutory application, legal reasoning, and property calculation. Four LLMs (ChatGPT-4o, Gemini, Copilot, and Grok3) were evaluated using a two-stage framework: decomposed sub-question accuracy (Stage 1) and full-length essay response performance with and without CoT prompting (Stage 2), with human scoring conducted by a law professor and a student. Results show that CoT prompting consistently improves legal reasoning quality across models, notably enhancing issue coverage, statutory citation accuracy, and reasoning structure. Gemini achieved the most significant accuracy gains (from 83.2% to 94.5%, p < 0.05) and was selected for detailed qualitative analysis. Beyond model-specific findings, this study contributes to retrieval evaluation research by addressing statistical consistency challenges in human scoring, proposing a diagnostic evaluation method adaptable for multilingual and multimedia legal corpora, and suggesting extensions for evaluating enterprise-level legal information systems. These findings underscore the value of structured prompting strategies in supporting more interpretable, transferable, and scalable legal AI evaluation frameworks.
Yu et al. (Fri,) studied this question.