What question did this study set out to answer?

To evaluate the effectiveness of various techniques in legal citation prediction using large language models (LLMs).

April 10, 2026Open Access

Legal citation prediction with LLMs: a comparative evaluation of instruction tuning, retrieval, and jurisdiction-specific pre-training on the AusLaw citation benchmark

JHJiuzhou HanMonash University PBPaul BurgessMonash University ESEhsan ShareghiUniversity College London

Key Points

To evaluate the effectiveness of various techniques in legal citation prediction using large language models (LLMs).
Introduced the AusLaw Citation Benchmark dataset with 55k instances and 18,677 citations.
Compared instruction tuning, retrieval strategies, and hybrid models in generating accurate citations.
Assessed performance using both stand-alone LLMs and jurisdiction-specific pre-trained models.
Stand-alone generative models showed poor performance in citation prediction.
Instruction tuning greatly enhanced model accuracy compared to base models.
BM25 retrieval method outperformed dense embeddings in citation retrieval tasks.

Abstract

As artificial intelligence transforms legal practice, deploying Large Language Models effectively has become critical. While LLMs show promise across legal tasks, challenges around factual accuracy and domain-specific reasoning persist, particularly for citation prediction–where authoritative references carry binding legal force. We introduce the AusLaw Citation Benchmark, comprising 55k real-world Australian instances and 18,677 unique citations–the largest jurisdiction-specific dataset for this task. We systematically compare prompting, retrieval, fine-tuning, and hybrid strategies, including instruction-tuned models, sparse and dense retrieval, and re-ranker ensembles. Our findings reveal that stand-alone generative models–whether general or law-specific–fail almost entirely, underscoring the risks of unaugmented deployment. Task-specific instruction tuning dramatically improves performance, BM25 outperforms dense embeddings in retrieval, and jurisdiction-specific pre-training surpasses larger but less targeted models. Hybrid approaches with trained re-rankers achieve the best results, yet a substantial 40% performance gap remains, exposing the persistent long-tail challenge in citation prediction. These results reframe assumptions about scale, retrieval, and fine-tuning, and establish a foundation for building reliable, jurisdiction-aware legal AI systems. For code, data, and models, see https://auslawbench.github.io/ .

Ask AI

Helpful

Bookmark

View Full Paper