Abstract With the development of generative artificial intelligence (AI) and the active implementation of large language models (LLMs) in the ubiquitous field, a very important task arises, which requires an objective evaluation of the quality of such AI systems. Traditional machine learning metrics turn out to be inapplicable, since solution responses of LLM-based solutions demonstrate high variability in wording while maintaining semantic correctness. This paper analyzes existing approaches to evaluate the quality of systems built on the basis of generative AI, such as lexical methods (term frequency–inverse document frequency, TF-IDF, and Best Matching 25, BM25), semantic embeddings, hybrid approaches based on LLM-as-a-Judge, and natural language inference (NLI) methods. Particular attention is paid to the development of an algorithm for selecting the optimal evaluation strategy depending on various tasks, including the latency of evaluation, the correctness and interpretability of the results, as well as the stability and reproducibility of the obtained evaluation results. For comparison, the work presents the results of various evaluation methods using the example of analyzing the accuracy and relevance of a response from an AI system on a set of 500 test examples, demonstrating a correlation with expert assessments in the range from 0.67 to 0.92, depending on the chosen approach. The proposed algorithm can be used to build a suitable evaluation process for AI systems in various domains.
Aleksandr Meshkov (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: