As artificial intelligence transforms legal practice, deploying Large Language Models effectively has become critical. While LLMs show promise across legal tasks, challenges around factual accuracy and domain-specific reasoning persist, particularly for citation prediction–where authoritative references carry binding legal force. We introduce the AusLaw Citation Benchmark, comprising 55k real-world Australian instances and 18,677 unique citations–the largest jurisdiction-specific dataset for this task. We systematically compare prompting, retrieval, fine-tuning, and hybrid strategies, including instruction-tuned models, sparse and dense retrieval, and re-ranker ensembles. Our findings reveal that stand-alone generative models–whether general or law-specific–fail almost entirely, underscoring the risks of unaugmented deployment. Task-specific instruction tuning dramatically improves performance, BM25 outperforms dense embeddings in retrieval, and jurisdiction-specific pre-training surpasses larger but less targeted models. Hybrid approaches with trained re-rankers achieve the best results, yet a substantial 40% performance gap remains, exposing the persistent long-tail challenge in citation prediction. These results reframe assumptions about scale, retrieval, and fine-tuning, and establish a foundation for building reliable, jurisdiction-aware legal AI systems. For code, data, and models, see https://auslawbench.github.io/ .
Building similarity graph...
Analyzing shared references across papers
Loading...
Jiuzhou Han
Paul Burgess
Ehsan Shareghi
Artificial Intelligence and Law
University College London
Monash University
Building similarity graph...
Analyzing shared references across papers
Loading...
Han et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69d894ce6c1944d70ce05b8f — DOI: https://doi.org/10.1007/s10506-026-09506-9