Abstract Large language models (LLMs) are the core of many Artificial Intelligence (AI) systems. One of the key problems with these systems is hallucination (i.e., making up facts). Retrieval-Augmented Generation (RAG) solves this problem by grounding responses in external knowledge sources, thereby improving the factual accuracy of the response. The RAG system consists of two core components: the information retrieval component (retriever and rerankers) and the text generation component (LLM). So the efficacy of a RAG system depends on the retrieval strategies, reranking mechanisms, and generation models. In this study, we conduct a systematic evaluation of 9 retriever–reranker configurations (3 retrievers (Fusion, HyDE, and HyPE), 3 rerankers (BGE, MiniLM, and GPT-4o-mini)) within a controlled RAG framework. Our analysis extends beyond traditional retrieval metrics by evaluating Mean Reciprocal Rank (MRR), generation correctness, faithfulness, relevance, cost, and latency. Results show that LLM-based reranking consistently improves downstream generation quality, with the HyPE + GPT-4o-mini configuration achieving the highest overall performance with correctness and relevance scores of 0.8012 and 0.9267, respectively, and the only positive MRR gain. While cross-encoder rerankers offer lower latency and cost, they exhibit a measurable decline in answer quality.
Elkiran et al. (Sat,) studied this question.