We present a reproducible retrieval benchmark for Kazakh — an agglutinative, low-resource language — comprising 300 queries over 8,370 Wikipedia passages across three query categories: inflected, natural, and vocabulary-gap. We evaluate five retrieval systems: BM25 with and without a Kazakh stemmer, and three zero-shot dense models (LaBSE, Granite-278M, E5-base). The Kazakh stemmer significantly improves BM25 retrieval (+16% nDCG@10 on inflected queries, p=0.0017; +9% overall, p=0.0001, n=300, paired bootstrap), and outperforms zero-shot LaBSE (0.754 vs 0.481). We also report three honest negative results: synonym query expansion hurts all categories, RRF hybrid fusion fails its pre-registered criteria, and better retrieval does not yield a significant end-to-end RAG accuracy gain on Qwen2.5-7B. All code, data, and results are publicly available at https://github.com/Tim2190/Kaz-RAG-search-benchmark.
Timur Seidalin (Tue,) studied this question.