What question did this study set out to answer?

The aim is to identify effective methods for evaluating generative AI systems and develop a selection algorithm.

July 1, 2026Open Access

Comparative analysis of evaluation methods for generative artificial intelligence systems and development of selection algorithm

Key Points

The aim is to identify effective methods for evaluating generative AI systems and develop a selection algorithm.
Analyzed existing evaluation approaches for generative AI including lexical methods and semantic embeddings.
Developed an algorithm for selecting evaluation strategies based on latency, correctness, and interpretability.
Tested various evaluation methods using 500 examples to assess correlation with expert evaluations.
Evaluated methods showed correlation with expert assessments between 0.67 and 0.92.
Found that traditional machine learning metrics are not suitable for evaluating responses from LLMs.
Proposed algorithm helps in choosing the best evaluation method depending on task requirements.

Abstract

Abstract With the development of generative artificial intelligence (AI) and the active implementation of large language models (LLMs) in the ubiquitous field, a very important task arises, which requires an objective evaluation of the quality of such AI systems. Traditional machine learning metrics turn out to be inapplicable, since solution responses of LLM-based solutions demonstrate high variability in wording while maintaining semantic correctness. This paper analyzes existing approaches to evaluate the quality of systems built on the basis of generative AI, such as lexical methods (term frequency–inverse document frequency, TF-IDF, and Best Matching 25, BM25), semantic embeddings, hybrid approaches based on LLM-as-a-Judge, and natural language inference (NLI) methods. Particular attention is paid to the development of an algorithm for selecting the optimal evaluation strategy depending on various tasks, including the latency of evaluation, the correctness and interpretability of the results, as well as the stability and reproducibility of the obtained evaluation results. For comparison, the work presents the results of various evaluation methods using the example of analyzing the accuracy and relevance of a response from an AI system on a set of 500 test examples, demonstrating a correlation with expert assessments in the range from 0.67 to 0.92, depending on the chosen approach. The proposed algorithm can be used to build a suitable evaluation process for AI systems in various domains.

Perguntar à IA

Bookmark

View Full Paper