The expansion of large language model (LLM) context windows raises a practical question for document-grounded question answering: if an entire source document fits into the prompt, is retrieval-augmented generation (RAG) still necessary? VerdictBench evaluates this question on 50 Indonesian Constitutional Court (Mahkamah Konstitusi, MK) verdicts and 300 human-reviewed question-answer pairs spanning four cognitive types. The study compares Long Context (LC), Dense RAG, and Multi-Stage RAG across Gemini 2. 5 Flash and GPT-4o Mini. Dense RAG is the repository's historical Simple RAG condition: fixed-size chunking, dense embedding retrieval, and FAISS top-5 selection. Multi-Stage RAG is the historical Advanced RAG condition: query rewriting, metadata filtering, hybrid BM25+dense retrieval, and cross-encoder reranking. A post-hoc audit found that the original LC faithfulness evaluation used a 503-character logging preview rather than the full generation context, so this revision reports gold-evidence faithfulness, where all answers are judged against human-verified evidence paragraphs. Under this metric, LC and Dense RAG have no statistically significant Phase 2 faithfulness difference across both model families, with negligible paired effect sizes (Gemini: dᵦ = 0. 008, p = 0. 43; GPT-4o Mini: dᵦ = 0. 017, p = 0. 49). Both outperform Multi-Stage RAG. Dense RAG remains operationally preferable: it achieves comparable faithfulness at 16-25x lower cost and avoids the 56. 7% long-verdict non-response rate observed for LC in Phase 1. The ablation study shows that every Multi-Stage RAG component reduces oracle faithfulness relative to the Dense RAG baseline. These results suggest that large context windows do not remove the need for retrieval in Indonesian legal QA; they shift the tradeoff from answer quality alone to cost, reliability, and evidence-selection control.
Muhammad Iqbal Hilmy Izzulhaq (Sat,) studied this question.