Abstract This study addresses limitations of traditional benchmarking methods for Retrieval-Augmented Generation (RAG) systems by proposing an evaluation framework for RAG-enhanced Large Language Models (LLMs). The framework structures evaluation dimensions and metrics, identifies suitable datasets and question types, and provides guidance for applying the framework in practice. A systematic literature review (SLR) was conducted, synthesizing evidence from 12 studies focused on the evaluation of RAG systems. The review employs a concept matrix to classify evaluative approaches and maps metrics to dimensions, evaluator types, and pipeline stages. In addition, we systematize dataset and question-type requirements that enable the proposed measurements and derive implementable evaluation guidance. The findings reveal substantial variation in evaluation practices and underscore the need for a multidimensional view. The framework integrates context relevance, faithfulness, answer relevance, correctness, and citation quality with corresponding metrics and links them to dataset prerequisites. It further outlines how the framework can be adapted to different RAG pipeline configurations, supporting use in real-world evaluation settings. The framework supports more systematic and transparent RAG evaluation design by consolidating dimensions, metrics, evaluators, and dataset requirements into a coherent structure. It offers actionable recommendations for selecting and operationalizing metrics and for integrating evaluation into RAG pipelines, thereby supporting the assessment and deployment of RAG-enhanced LLMs in dynamic environments.
Knollmeyer et al. (Tue,) studied this question.