What question did this study set out to answer?

This research aims to assess the effectiveness of various retriever-reranker setups in Retrieval-Augmented Generation systems, focusing on both quality and efficiency.

May 11, 2026Open Access

Evaluating retriever reranker pairings in RAG based on quality and efficiency trade-offs

HEHarun Elkiranİstanbul Sabahattin Zaim Üniversitesi JRJawad Rasheedİstanbul Sabahattin Zaim Üniversitesi

Key Points

This research aims to assess the effectiveness of various retriever-reranker setups in Retrieval-Augmented Generation systems, focusing on both quality and efficiency.
Systematic evaluation of 9 retriever-reranker configurations using a controlled RAG framework.
Assessment metrics include Mean Reciprocal Rank (MRR), generation correctness, faithfulness, relevance, cost, and latency.
Three retrievers (Fusion, HyDE, HyPE) and three rerankers (BGE, MiniLM, GPT-4o-mini) were tested.
The HyPE + GPT-4o-mini configuration achieved the highest correctness and relevance scores of 0.8012 and 0.9267, respectively, along with a positive MRR gain.
LLM-based reranking consistently enhanced downstream generation quality compared to other configurations.
Cross-encoder rerankers displayed lower latency and cost, but at the expense of answer quality.

Abstract

Abstract Large language models (LLMs) are the core of many Artificial Intelligence (AI) systems. One of the key problems with these systems is hallucination (i.e., making up facts). Retrieval-Augmented Generation (RAG) solves this problem by grounding responses in external knowledge sources, thereby improving the factual accuracy of the response. The RAG system consists of two core components: the information retrieval component (retriever and rerankers) and the text generation component (LLM). So the efficacy of a RAG system depends on the retrieval strategies, reranking mechanisms, and generation models. In this study, we conduct a systematic evaluation of 9 retriever–reranker configurations (3 retrievers (Fusion, HyDE, and HyPE), 3 rerankers (BGE, MiniLM, and GPT-4o-mini)) within a controlled RAG framework. Our analysis extends beyond traditional retrieval metrics by evaluating Mean Reciprocal Rank (MRR), generation correctness, faithfulness, relevance, cost, and latency. Results show that LLM-based reranking consistently improves downstream generation quality, with the HyPE + GPT-4o-mini configuration achieving the highest overall performance with correctness and relevance scores of 0.8012 and 0.9267, respectively, and the only positive MRR gain. While cross-encoder rerankers offer lower latency and cost, they exhibit a measurable decline in answer quality.

Ask AI

Helpful

Bookmark

View Full Paper