What question did this study set out to answer?

This research aims to enhance the evaluation quality of large language models by optimizing evaluators through data augmentation and ORPO algorithm.

April 1, 2026Open Access

ISLab at the NTCIR-18 AEOLLM: An Evaluator for Machine-Generated Text based on Data Augmentation and ORPO

Key Points

This research aims to enhance the evaluation quality of large language models by optimizing evaluators through data augmentation and ORPO algorithm.
Leveraged data augmentation to increase training dataset size
Employed odds ratio preference optimization for evaluator fine-tuning
Utilized the NTCIR-18 AEOLLM dataset for training and testing
Evaluated model performance on summary generation and text expansion subtasks
Achieved an accuracy of 0.7658 on the summary generation subtask, the highest among all models evaluated
Secured the second-highest Kendall's tau and Spearman correlation on summary generation and text expansion subtasks

Abstract

In recent years, large language models (LLMs) have been widely applied to various natural language processing (NLP) tasks, demonstrating exceptional performance. To evaluate the output quality of these LLMs, numerous studies utilize one LLM as an evaluator to assess the quality of outputs from other LLMs, showing promising results on public benchmarks. However, the performance of LLMs as evaluators on many unpublished benchmarks still needs improvement. To achieve better evaluation performance, some studies have attempted to fine-tune evaluators based on large amounts of data, incurring significant manual costs and posing substantial limitations in practical applications. Therefore, this paper leverages data augmentation to increase the volume of training data and employs the odds ratio preference optimization (ORPO) algorithm for reinforcement learning to optimize the evaluator. This study uses the dataset provided by NTCIR-18’s Automatic Evaluation of LLMs (AEOLLM) task for training and testing. The proposed method achieves an accuracy of 0.7658 on the summary generation subtask of AEOLLM, the highest among all compared models. Additionally, it yields the second-highest performance in both Kendall’s tau and Spearman correlation coefficient on the summary generation and text expansion subtasks among all compared models.

ISLab at the NTCIR-18 AEOLLM: An Evaluator for Machine-Generated Text based on Data Augmentation and ORPO

Key Points

Abstract

Cite This Study