What does this research mean for the field?

Dense Retrieval-Augmented Generation (RAG) achieves comparable answer faithfulness to full-document prompting for Indonesian legal question answering, while being significantly more cost-effective and reliable. Novelty: ClaimNovelty.INCREMENTAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The study investigates whether retrieval-augmented generation is necessary when entire documents fit in the prompt for legal question answering.

June 29, 2026Open Access

When Context Is Not Enough: Retrieval Outperforms Full-Document Prompting on Indonesian Constitutional Court Verdicts

Key Points

The study investigates whether retrieval-augmented generation is necessary when entire documents fit in the prompt for legal question answering.
Evaluated 50 Indonesian Constitutional Court verdicts and 300 question-answer pairs.
Compared Long Context, Dense RAG, and Multi-Stage RAG using Gemini 2.5 Flash and GPT-4o Mini models.
Conducted a post-hoc audit on faithfulness evaluation based on human-verified evidence paragraphs.
No significant faithfulness difference between Long Context and Dense RAG across both model families (Gemini: d_z = 0.008, p = 0.43; GPT-4o Mini: d_z = 0.017, p = 0.49).
Dense RAG is 16-25x cheaper while maintaining comparable faithfulness and reducing non-response rates.
Ablation study indicates all components of Multi-Stage RAG decrease oracle faithfulness relative to Dense RAG.

Abstract

The expansion of large language model (LLM) context windows raises a practical question for document-grounded question answering: if an entire source document fits into the prompt, is retrieval-augmented generation (RAG) still necessary? VerdictBench evaluates this question on 50 Indonesian Constitutional Court (Mahkamah Konstitusi, MK) verdicts and 300 human-reviewed question-answer pairs spanning four cognitive types. The study compares Long Context (LC), Dense RAG, and Multi-Stage RAG across Gemini 2. 5 Flash and GPT-4o Mini. Dense RAG is the repository's historical Simple RAG condition: fixed-size chunking, dense embedding retrieval, and FAISS top-5 selection. Multi-Stage RAG is the historical Advanced RAG condition: query rewriting, metadata filtering, hybrid BM25+dense retrieval, and cross-encoder reranking. A post-hoc audit found that the original LC faithfulness evaluation used a 503-character logging preview rather than the full generation context, so this revision reports gold-evidence faithfulness, where all answers are judged against human-verified evidence paragraphs. Under this metric, LC and Dense RAG have no statistically significant Phase 2 faithfulness difference across both model families, with negligible paired effect sizes (Gemini: dᵦ = 0. 008, p = 0. 43; GPT-4o Mini: dᵦ = 0. 017, p = 0. 49). Both outperform Multi-Stage RAG. Dense RAG remains operationally preferable: it achieves comparable faithfulness at 16-25x lower cost and avoids the 56. 7% long-verdict non-response rate observed for LC in Phase 1. The ablation study shows that every Multi-Stage RAG component reduces oracle faithfulness relative to the Dense RAG baseline. These results suggest that large context windows do not remove the need for retrieval in Indonesian legal QA; they shift the tradeoff from answer quality alone to cost, reliability, and evidence-selection control.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper